Globus Online and the Science DMZ: Scalable Research Data Management Infrastructure for HPC Facilities Steve Tuecke, Raj Kettimuthu, and Vas Vasiliadis University of Chicago and Argonne National Laboratory Eli Dart ESnet SC13 – Denver, CO November 17, 2013 What if the research work flow could be managed as easily as… …our pictures …our e-mail …home entertainment 2 What makes these services great? Great User Experience + Scalable (but invisible) infrastructure 3 We aspire to create a great user experience for research data management What would a “dropbox for science” look like? 4 Managing data should be easy … Registry Staging Store Ingest Store Community Store Analysis Store Archive Mirror 5 … but it’s hard and frustrating! ! ! Staging Store Permission denied Registry Expired Ingest credentials Store ! Community Store ! Failed Server / Net Analysis Store NATs Firewalls Archive Mirror 6 Usability AND Performance • Three things are all required for data transfer tools 1. 2. 3. Reliability Consistency Performance • This is an ordered list! • • Q: How do we get a good toolset deployed on a reliable platform that consistently achieves high performance? A: Architecture à Foundation à Toolset • First some background, then we’ll define that architecture 7 Long Distance Data Transfers • Long distance data movement is different from other HPC center services - HPC facility only controls one end of the end to end path - Distances are measured in milliseconds, not nanoseconds - Software tools are different • As data sets grow (GB à TB à PB à EB à …) performance requirements increase - Human time scale does not change with data scale - Automation requirements increase with data scale • Default configurations are no longer sufficient - Networks, systems, storage and tools must be built and deployed properly (you wouldn’t deploy a supercomputer without tuning it) - Architecture must account for governing dynamics of data transfer applications Required Components • Long distance data services require several things - High performance networks Sufficient capacity No packet loss - Well-configured systems at the ends - Tools that scale • Components must be assembled correctly - Ad-hoc deployments seldom work out well - Details matter • If we understand the constraints and work within them, we are more likely to be consistently successful Boundaries And Constraints • Technology changes quickly – people and protocols change slowly - Increases in data, no change in people - Better tools, but using the same underlying protocols • Tools must be easy for people to use - A scientist’s time is very valuable - Computers are good at certain details Management of data transfers Failure recovery Manipulation of thousands of files - Tools must easily allow scientists to scale themselves up • Protocols are hard to change - The Internet protocol suite is what we have to work with - Make sure data transfer architecture accounts for protocol behavior TCP Background • Networks provide connectivity between hosts – how do hosts see the network? - From an application’s perspective, the interface to “the other end” is a socket - Communication is between applications – mostly over TCP • TCP – the fragile workhorse - TCP is (for very good reasons) timid – packet loss is interpreted as congestion - Packet loss in conjunction with latency is a performance killer - Like it or not, TCP is used for the vast majority of data transfer applications 11 TCP Background (2) • It is far easier to architect the network to support TCP than it is to fix TCP - People have been trying to fix TCP for years – limited success - Here we are – packet loss is still the number one performance killer in long distance high performance environments • Pragmatically speaking, we must accommodate TCP - Implications for equipment selection Provide loss-free service to TCP Equipment must be able to accurately account for packets - Implications for network architecture, deployment models Infrastructure must be designed to allow easy troubleshooting Test and measurement tools are critical 12 Soft Failures • An automatic network test alerted ESnet to reduced/poor performance on some of our network high latency paths • Routers did not detect any issues - perfSONAR/specialized performance software detected the problem • PerfSONAR measured a 0.0046% packet loss - 1 packet in 22000 packets • Performance impact of this: (outbound/inbound) - To/from test host 1 ms RTT : 7.3 Gbps out / 9.8 Gbps in - To/from test host 11 ms RTT: 1.2 Gbps out / 9.5 Gbps in - To/from test host 51ms RTT: 800 Mbps out / 9 Gbps in - To/from test host 88 ms RTT: 500 Mbps out / 8 Gbps in CONCLUSION: Data transfer across the US was more than 16 times slower than it should have been. Science is impacted, researchers get frustrated. 13 A small amount of packet loss makes a huge difference in TCP performance 14 How Do We Accommodate TCP? • High-performance wide area TCP flows must get loss-free service - Sufficient bandwidth to avoid congestion - Deep enough buffers in routers and switches to handle bursts Especially true for long-distance flows due to packet behavior No, this isn’t buffer bloat • Equally important – the infrastructure must be verifiable so that clean service can be provided - Stuff breaks Hardware, software, optics, bugs, … How do we deal with it in a production environment? - Must be able to prove a network device or path is functioning correctly Accurate counters must exist and be accessible Need ability to run tests – perfSONAR - Small footprint is a huge win – small number of devices so that problem isolation is tractable The Data Transfer Trifecta: The “Science DMZ” Model Dedicated Systems for Data Transfer Data Transfer Node • • • High performance Configured for data transfer Proper tools, such as Globus Online Network Architecture Science DMZ • • • • Dedicated location for DTN Proper security Easy to deploy - no need to redesign the whole network Additional info: http://fasterdata.es.net/ Performance Testing & Measurement perfSONAR • • • Enables fault isolation Verify correct operation Widely deployed in ESnet and other networks, as well as sites and facilities Science DMZ Takes Many Forms • There are a lot of ways to combine these things – it all depends on what you need to do - Small installation for a project or two - Facility inside a larger institution - Institutional capability serving multiple departments/divisions - Science capability that consumes a majority of the infrastructure • Some of these are straightforward, others are less obvious • Key point of concentration: High-latency path for TCP 17 The Data Transfer Trifecta: The “Science DMZ” Model Dedicated Systems for Data Transfer Data Transfer Node • • • High performance Configured for data transfer Proper tools Network Architecture Science DMZ • • • • Dedicated location for DTN Proper security Easy to deploy - no need to redesign the whole network Additional info: http://fasterdata.es.net/ Performance Testing & Measurement perfSONAR • • • Enables fault isolation Verify correct operation Widely deployed in ESnet and other networks, as well as sites and facilities Small-scale or Prototype Science DMZ • Add-on to existing network infrastructure - All that is required is a port on the border router - Small footprint, pre-production commitment • Easy to experiment with components and technologies - DTN prototyping - perfSONAR testing • Limited scope makes security policy exceptions easy - Only allow traffic from partners - Add-on to production infrastructure – lower risk 19 Science DMZ – Basic Abstract Cartoon Border Router perfSONAR WAN 10G Enterprise Border Router/Firewall 10GE 10GE Site / Campus access to Science DMZ resources Clean, High-bandwidth WAN path perfSONAR 10GE Site / Campus LAN Science DMZ Switch/Router 10GE perfSONAR Per-service security policy control points High performance Data Transfer Node with high-speed storage Abstract Science DMZ Data Path Border Router perfSONAR WAN 10G Enterprise Border Router/Firewall 10GE 10GE Site / Campus access to Science DMZ resources Clean, High-bandwidth WAN path perfSONAR 10GE Site / Campus LAN Science DMZ Switch/Router 10GE perfSONAR Per-service security policy control points High Latency WAN Path High performance Data Transfer Node with high-speed storage Low Latency LAN Path Supercomputer Center Deployment • High-performance networking is assumed in this environment - Data flows between systems, between systems and storage, wide area, etc. - Global filesystem often ties resources together Portions of this may not run over Ethernet (e.g. IB) Implications for Data Transfer Nodes • “Science DMZ” may not look like a discrete entity here - By the time you get through interconnecting all the resources, you end up with most of the network in the Science DMZ - This is as it should be – the point is appropriate deployment of tools, configuration, policy control, etc. • Office networks can look like an afterthought, but they aren’t - Deployed with appropriate security controls - Office infrastructure need not be sized for science traffic 22 Supercomputer Center Border Router WAN Firewall Routed Offices perfSONAR Virtual Circuit perfSONAR Core Switch/Router Front end switch Front end switch perfSONAR Data Transfer Nodes Supercomputer Parallel Filesystem Supercomputer Center Border Router WAN Firewall Routed Offices perfSONAR Virtual Circuit perfSONAR Core Switch/Router Front end switch Front end switch perfSONAR Data Transfer Nodes High Latency WAN Path Supercomputer Low Latency LAN Path Parallel Filesystem High Latency VC Path Major Data Site Deployment • In some cases, large scale data service is the major driver - Huge volumes of data – ingest, export - Individual DTNs don’t exist here – data transfer clusters • Single-pipe deployments don’t work - Everything is parallel Networks (Nx10G LAGs, soon to be Nx100G) Hosts – data transfer clusters, no individual DTNs WAN connections – multiple entry, redundant equipment - Choke points (e.g. firewalls) cause problems 25 Data Site – Architecture Virtual Circuit VC Provider Edge Routers WAN Virtual Circuit perfSONAR Data Transfer Cluster Border Routers HA Firewalls VC perfSONAR Site/Campus LAN Data Service Switch Plane perfSONAR Data Site – Data Path Virtual Circuit VC Provider Edge Routers WAN Virtual Circuit perfSONAR Data Transfer Cluster Border Routers HA Firewalls VC perfSONAR Site/Campus LAN Data Service Switch Plane perfSONAR Common Threads • Two common threads exist in all these examples • Accommodation of TCP - Wide area portion of data transfers traverses purpose-built path - High performance devices that don’t drop packets • Ability to test and verify - When problems arise (and they always will), they can be solved if the infrastructure is built correctly - Small device count makes it easier to find issues - Multiple test and measurement hosts provide multiple views of the data path perfSONAR nodes at the site and in the WAN perfSONAR nodes at the remote site The Data Transfer Trifecta: The “Science DMZ” Model Dedicated Systems for Data Transfer Data Transfer Node • • • High performance Configured for data transfer Proper tools Network Architecture Science DMZ • • • • Dedicated location for DTN Proper security Easy to deploy - no need to redesign the whole network Additional info: http://fasterdata.es.net/ Performance Testing & Measurement perfSONAR • • • Enables fault isolation Verify correct operation Widely deployed in ESnet and other networks, as well as sites and facilities Test and Measurement – Keeping the Network Clean • The wide area network, the Science DMZ, and all its systems start out functioning perfectly • Eventually something is going to break - Networks and systems are built with many, many components - Sometimes things just break – this is why we buy support contracts • Other problems arise as well – bugs, mistakes, whatever • We must be able to find and fix problems when they occur • Why is this so important? Because we use TCP! perfSONAR • All the network diagrams have little perfSONAR boxes everywhere - The reason for this is that consistent behavior requires correctness - Correctness requires the ability to find and fix problems You can’t fix what you can’t find You can’t find what you can’t see perfSONAR lets you see • Especially important when deploying high performance services - If there is a problem with the infrastructure, need to fix it - If the problem is not with your stuff, need to prove it Many players in an end to end path Ability to show correct behavior aids in problem localization 31 What is perfSONAR? • perfSONAR is a tool to: - Set network performance expectations - Find network problems (“soft failures”) - Help fix these problems • All in multi-domain environments - Widely deployed international capability - Easy for HPC centers to attach to Leverage international test and measurement Reduce troubleshooting overhead (no need for just-in-time tools deployment 32 perfSONAR - perfSONAR is a critical toolset for maintaining highperformance data transfer infrastructure - http://www.perfsonar.net/ - http://fasterdata.es.net/performance-testing/perfsonar/ - More information this afternoon - There is also a perfSONAR tutorial tomorrow Ensuring Network Performance with perfSONAR http://sc13.supercomputing.org/content/tutorials The Data Transfer Trifecta: The “Science DMZ” Model Dedicated Systems for Data Transfer Data Transfer Node • • • High performance Configured for data transfer Proper tools Network Architecture Science DMZ • • • • Dedicated location for DTN Proper security Easy to deploy - no need to redesign the whole network Additional info: http://fasterdata.es.net/ Performance Testing & Measurement perfSONAR • • • Enables fault isolation Verify correct operation Widely deployed in ESnet and other networks, as well as sites and facilities The Data Transfer Node • The Data Transfer Node is a high-performance host, typically a Linux PC server, that is optimized for a data mobility workload • A DTN server is made of several subsystems. Each needs to perform optimally for the DTN workflow: • Storage: capacity, performance, reliability, physical footprint • Networking: protocol support, optimization, reliability • Motherboard: I/O paths, PCIe subsystem, IPMI • DTN is dedicated to data transfer, and not doing data analysis/ manipulation • In HPC centers, DTNs typically mount a global filesystem - Best deployed in sets (e.g. 4-8 DTNs) - Globus Online can use multiple DTNs DTN Toolset • DTNs are built for wide area data movement - Entry point for external data - Well-configured for consistent performance - Need optimized tools • Remember from before – we need a powerful toolset with advanced features to handle data scale - - - - Easy user interface Management of data transfers Failure recovery Manipulation of thousands (or millions, or …) of files • Globus Online provides these features – ideal DTN toolset Globus Online: Software-as-a-Service for Research Data Management 37 What is Globus Online? Big data transfer and sharing… …with Dropbox-like simplicity… …directly from your own storage systems 38 Reliable, secure, high-performance file transfer and synchronization • “Fire-and-forget” transfers 2 Globus Online • Automatic fault recovery Data Source moves and syncs files Data Destination • Seamless security integration 1 User initiates transfer request 3 Globus Online notifies user 39 Simple, secure sharing off existing storage systems • Easily share large data with any user or group 2 • No cloud storage required 1 User A selects file(s) to share, selects user or group, and sets permissions 40 Globus Online tracks shared files; no need to move files to cloud storage! Data Source 3 User B logs in to Globus Online and accesses shared file Globus Online is SaaS • Web, command line, and REST interfaces • Reduced IT operational costs • New features automatically available • Consolidated support & troubleshooting • Easy to add your laptop, server, cluster, supercomputer, etc. with Globus Connect 41 Globus Online For HPC Centers • Globus Online is easy for users to use • How might a HPC center make large scale science data available via Globus Online? • The Science DMZ model provides a set of design patterns - High-performance foundation - Enable next-generation data services for science - Ideal platform for Globus Online deployment 42 Demonstration File transfer, sharing and group management 43 Exercise 1 Globus Online signup and configuration 44 Exercise 2 Globus Online file transfer and sharing 45 David Skinner, Shreyas Cholia, Jason Hick, Matt Andrews NERSC, LBNL NERSC Use Case Networked Data @ NERSC 2013 4PB JGI File Systems 5PB /project 250TB /global/homes 4PB /global/scratch IB Subnet 1 GPFS Servers DTNs Carver pNSDs Genepool IB Subnet 2 PDSF Hopper DVS Edison DVS 47 - WAN Data Transfer @ NERSC • NERSC hosts several PB of data in HPSS and Global FS • Working towards Globus Online Provider Plan • Data transfer working group - Peer with ESnet and other national labs to solve end-to-end data challenges • Advanced Networking Initiative with ANL/ESnet (100Gb Ethernet) • Data transfer nodes optimized for high performance WAN transfers • User-centric – User Services and data experts work with user to solve transfer challenges • Support for multiple interfaces - Globus Online - GridFTP - BBCP Very popular w/ users - Optimized SCP • Distribution point for scientific data – public data repositories and science gateways - QCD, ESG, DeepSky, PTF, MaterialsProject, Daya Bay, C20C, SNO etc. GO @ NERSC 2012-23 Trends 120 Sources 175 Destinations Bytes sent to NERSC, per unique endpoint 1e+11 1e+08 Total bytes transferred 1e+02 1e+05 1e+10 1e+07 1e+04 1e+01 Total bytes transferred 1e+13 1e+14 Bytes sent from NERSC, per unique endpoint 0 50 100 150 Sources 124 Users 4.5K Transfers 46M files, 8K dirs 0.5PB, 3K deletes 0 50 100 150 200 Destinations 134 Users 3K Transfers 22M files, 7K dirs 0.4PB We also have 100 Globus Connect users (not included above) Optimizing Inter-Site Data Movement • NERSC is a net importer of data. - Since 2007, we receive more bytes than transfer out to other sites. - Leadership in inter-site data transfers (Data Transfer Working Group). - Have been the an archive site for several valuable scientific projects: SNO project (70TB), 20th Century Reanalysis project (140TB) • NERSC needs production support for data movement. New features. • Partnering with ANL and ESnet to advance HPC network capabilities. - 100Gb, Magellan ANL and NERSC cloud research infrastructure. - Community of GO Providers? • Data Transfer Working Group - Coordination between stake holders (ESnet, ORNL, ANL, LANL and LBNL/NERSC) • Data Transfer Nodes - - - - - - - GlobusOnline endpoints Available to all NERSC users Optimized for WAN transfers and regularly tested Close to the NERSC border hsi for HPSS-to-HPSS transfers (OLCF and NERSC) gridFTP for system-to-system or system-to-HPSS transfers Bbcp too How is this important to Science? Use cases in data-centric science from projects hosted at NERSC Science Discovery & Data Analysis Astrophysics discover early nearby supernova • Palomar Transient Factory runs machine learning algorithms on ~300GB/night delivered by Esnet & science DMZ • Rare glimpse of a supernova within 11 hours of explosion, 20M light years away • Telescopes world-wide redirected within 1 hour Data systems essential to science success • GPFS /project file system mounted on resources centerwide, brings broad range of resources to the data • Data Transfer Nodes and Science Gateway Nodes improve data acquisition, access and processing capabilities 23 August 24 August 25 August One of Science Magazine’s Top 10 Breakthroughs of 2012 Discovery of θ13 weak mixing angle • The last and most elusive piece of a longstanding puzzle: How can neutrinos appear to vanish as they travel? • The answer – a new, large type of neutrino oscillation - Affords new understanding of fundamental physics - May help solve the riddle of matter-antimatter asymmetry in the universe. Detectors count antineutrinos near the Daya Bay nuclear reactor in Japan. By calculating how many would be seen if there were no oscillation and comparing to measurements, a 6.0% rate deficit provides clear evidence of the new transformation. Experiment Could Not Have Been Done Without NERSC and ESNet • • • PDSF for simulation and analysis HPSS for archiving and ingesting data ESNet for data transfer • • NERSC Global File System & Science Gateways for distributing results NERSC is the only US site where all raw, simulated, and derived data are analyzed and archived PI: Kam-Biu Luk (LBNL) BES Light Sources • Current and emerging needs in prompt and off-line computational analysis of light source data show bottlenecks. • 150TB / experiment (Observed – Predicted) spot positions, entire run, one tile • 180K images pixels • Bursty, Realtime needs • Automated hands-off data migration • HTC Analysis • Detector roadmaps go only up • Science moving toward pixels System Name Frame Rate (Hz) Image size compressed (MB) Data rate (MB/s) Pilatus 6M 12 6 72 Pilatus 6M fast 25 6 150 Pilatus3 6M 100 6 600 Eiger 16M 180 16 2880 larger systems (surfaces, clusters, membranes • Need data interfaces! Dectris Ltd: pixel array detectors Framework Compare Experiment Remote User Remote User Prompt Analysis Pipeline Science Gateway Beamline User HPC Compute Resources image reference Experiment metadata Data Transfer Node Data Pipeline HPC Storage and Archive Growing science drivers: single particle work, reconstruction, heterogeneous samples Giant mimivirus: XFEL single particle diffraction AFM Atomic force microscopy EM Cryo-electron microscope reconstruction EM Autocorrelation function 30000-images combined Phased imaging From X-ray diffraction Seibert(2011) Nature Rossmann(2011) Intervirology Carboxysome Meeting Challenges w/Globus Online • Well worn transfer paths usually work well but significant challenges connecting to new data customers - Significant admin effort on all sides - Pinpointing bottlenecks can be tricky especially when you don’t have control over the network - Socialize other sites to use DTN and DMZ model - Data transfer working group has been HUGE help, Globus Support w/ users. • Need constant monitoring. Small changes or upgrades to network can kill performance. • Last mile problem – how do you deal with bottlenecks on the end user workstation? • Usability issues – most of the knobs on transfer interfaces seem like black magic to the end user. IDEAS: • Globus Online has helped alleviate pain on the user side. Can we have a view for network sysadmins (detailed error reporting, diagnostics … )? • Central service to diagnose network performance issues? • Continuous centrally managed performance monitoring? NERSC & Globus Online Futures NERSC WAN Requirements: last 10 years Daily WAN traffic in/out of NERSC over the last decade points to Tb networking in 2016 NERSC Daily WAN Traffic since 2001 100000 GB/day 10000 1000 100 2000 2002 2004 2006 2008 2010 Year 2012 2014 2016 2018 Globus Online in 2016? • Yesterday: Explicit transfers using knowledge of tools, techniques and crafts. - Talk to the right person - Use the right system (tuned) - Use the right client software (GO, gridFTP, bbcp, …) - Have the right storage hardware - Manage the data (copies) • Today: - Hosted 3rd party, hands off transfers - Soon: HPSS integration, session limits, dashboards, EZ auth • Exascale: Cross-mounted storage systems at key sites and unattended data set transfer capability. - Automation, everywhere. Maybe local caches for performance? - Latency guaranteed and reliable - QOS for bandwidth guarantee - Site specific authentication and authorization issues solved 2016 BES Science Depends on ESnet • BES lightsources (synchrotrons and FELs) are experiencing an exponential increase of data. - 2009 65 TB/yr , 2011 312 TB/yr, 2013 1.9 PB /yr - To the degree that beamline science is bracketed by data analysis bottlenecks, these instruments will not deliver their full scientific value • Beamline scientists and users are realizing that end-station data storage and analysis is not scalable. BES communities are ready for science gateway approach. • Computing facilities form the ideal crucible for developing and testing the end-to-end system needed for current and next-generation photon science. • LBNL LDRDs in 2013, ASCR Data Postdocs Demonstration: Globus Online advanced file transfer features - Data synchronization - Endpoint management - Command line interface 62 Morning Session Recap 63 The Science DMZ Configuration and Troubleshooting Outline • Science DMZ – architectural components • Using perfSONAR • Problems With Firewalls • Data Transfer Nodes Science DMZ Background • The data mobility performance requirements for data intensive science are beyond what can typically be achieved using traditional methods - Default host configurations (TCP, filesystems, NICs) - Converged network architectures designed for commodity traffic - Conventional security tools and policies - Legacy data transfer tools (e.g. SCP) - Wait-for-trouble-ticket operational models for network performance Science DMZ Background (2) • The Science DMZ model describes a performance-based approach - Dedicated infrastructure for wide-area data transfer Well-configured data transfer hosts with modern tools Capable network devices High-performance data path which does not traverse commodity LAN - Proactive operational models that enable performance Well-deployed test and measurement tools (perfSONAR) Periodic testing to locate issues instead of waiting for users to complain - Security posture well-matched to high-performance science applications A small amount of packet loss makes a huge difference in TCP performance The Data Transfer Trifecta: The “Science DMZ” Model Dedicated Systems for Data Transfer Data Transfer Node • • • High performance Configured for data transfer Proper tools, such as Globus Online Network Architecture Science DMZ • • • • Dedicated location for DTN Proper security Easy to deploy - no need to redesign the whole network Additional info: http://fasterdata.es.net/ Performance Testing & Measurement perfSONAR • • • Enables fault isolation Verify correct operation Widely deployed in ESnet and other networks, as well as sites and facilities Abstract Science DMZ Border Router perfSONAR WAN 10G Enterprise Border Router/Firewall 10GE 10GE Site / Campus access to Science DMZ resources Clean, High-bandwidth WAN path perfSONAR 10GE Site / Campus LAN Science DMZ Switch/Router 10GE perfSONAR Per-service security policy control points High Latency WAN Path High performance Data Transfer Node with high-speed storage Low Latency LAN Path 70 Supercomputer Center Border Router WAN Firewall Routed Offices perfSONAR Virtual Circuit perfSONAR Core Switch/Router Front end switch Front end switch perfSONAR Data Transfer Nodes High Latency WAN Path Supercomputer Low Latency LAN Path Parallel Filesystem High Latency VC Path 71 Data Site Virtual Circuit VC Provider Edge Routers WAN Virtual Circuit perfSONAR Data Transfer Cluster Border Routers HA Firewalls VC perfSONAR Site/Campus LAN Data Service Switch Plane perfSONAR 72 Common Threads • Two common threads exist in all these examples • Accommodation of TCP - Wide area portion of data transfers traverses purpose-built path - High performance devices that don’t drop packets • Ability to test and verify - When problems arise (and they always will), they can be solved if the infrastructure is built correctly - Small device count makes it easier to find issues - Multiple test and measurement hosts provide multiple views of the data path perfSONAR nodes at the site and in the WAN perfSONAR nodes at the remote site The Data Transfer Trifecta: The “Science DMZ” Model Dedicated Systems for Data Transfer Data Transfer Node • • • High performance Configured for data transfer Proper tools Network Architecture Science DMZ • • • • Dedicated location for DTN Proper security Easy to deploy - no need to redesign the whole network Additional info: http://fasterdata.es.net/ Performance Testing & Measurement perfSONAR • • • Enables fault isolation Verify correct operation Widely deployed in ESnet and other networks, as well as sites and facilities 74 perfSONAR • All the network diagrams have little perfSONAR boxes everywhere - The reason for this is that consistent behavior requires correctness - Correctness requires the ability to find and fix problems You can’t fix what you can’t find You can’t find what you can’t see perfSONAR lets you see • Especially important when deploying high performance services - If there is a problem with the infrastructure, need to fix it - If the problem is not with your stuff, need to prove it Many players in an end to end path Ability to show correct behavior aids in problem localization 75 Regular perfSONAR Tests • We run regular tests to check for two things - TCP throughput – default is every 4 hours, ESnet runs every 6 - One way delay and packet loss • perfSONAR has mechanisms for managing regular testing between perfSONAR hosts - Statistics collection and archiving - Graphs - Dashboard display • This infrastructure is deployed now – perfSONAR hosts at facilities can take advantage of it • At-a-glance health check for data infrastructure 76 perfSONAR Dashboard Status at-a-glance • Packet loss • Throughput • Correctness Current live instance at http://ps-dashboard.es.net/ Drill-down capabilities • Test history between hosts • Ability to correlate with other events • Very valuable for fault localization 77 Throughput – Setting Expectations Throughput measurements • Periodic verification • More concrete than packet loss • Easier to show to users Current live instance at http://ps-dashboard.es.net/ Setting expectations is key • If the tester can’t get good performance, forget big data • If the tester works and the DTNs do not, you’ve learned something 78 Throughput Detail Graph • Temporary drop in performance was due to re-route around a fiber cut - Latency increase - Clean otherwise (performance stayed high) • Other than that, it’s stable, and performs well (over 2Gbps per stream) • This is a powerful tool for expectation management 79 Quick perfSONAR Demo • Look up and test to a remote perfSONAR server 80 perfSONAR Deployment Locations • Critical to deploy such that you can test with useful semantics • perfSONAR hosts allow parts of the path to be tested separately - Reduced visibility for devices between perfSONAR hosts - Must rely on counters or other means where perfSONAR can’t go • Effective test methodology derived from protocol behavior - TCP suffers much more from packet loss as latency increases - TCP is more likely to cause loss as latency increases - Testing should leverage this in two ways Design tests so that they are likely to fail if there is a problem Mimic the behavior of production traffic as much as possible - Note: don’t design your tests to succeed The point is not to “be green” even if there are problems The point is to find problems when they come up so that the problems are fixed quickly 81 WAN Test Methodology – Problem Isolation • Segment-to-segment testing is unlikely to be helpful - TCP dynamics will be different - Problem links can test clean over short distances - This also goes for testing from a Science DMZ to the border router or first provider perfSONAR host • Run long-distance tests - Run the longest clean test you can, then look for the shortest dirty test that includes the path of the clean test - In many cases, there is a problem between the two remote test locations • In order for this to work, the testers need to be already deployed when you start troubleshooting - ESnet has at least one perfSONAR host at each hub location. So does Internet2. So do many regional networks. - If your provider does not have perfSONAR deployed, ask them why – then ask when they will have it done 82 Wide Area Testing – Problem Statement Border perfSONAR Science DMZ perfSONAR Science DMZ perfSONAR Border perfSONAR perfSONAR perfSONAR perfSONAR perfSONAR 10GE 10GE 10GE 10GE Nx10GE 10GE National Labortory University Campus Poor Performance WAN 83 Wide Area Testing – Full Context Border perfSONAR Science DMZ perfSONAR Science DMZ perfSONAR Border perfSONAR perfSONAR perfSONAR perfSONAR perfSONAR 10GE 10GE 10GE 10GE Nx10GE Lab ~1 msec Nx10GE perfSONAR perfSONAR 10GE 100GE perfSONAR 10GE 10GE 10GE perfSONAR perfSONAR 10GE 10GE 100GE ESnet path ~30 msec 10GE 10GE Poor Performance perfSONAR 100GE Campus ~1 msec 10GE perfSONAR perfSONAR 10GE 100GE Regional Path ~2 msec 10GE 10GE 100GE Internet2 path ~15 msec 84 Wide Area Testing – Long Clean Test Border perfSONAR Science DMZ perfSONAR Science DMZ perfSONAR Border perfSONAR perfSONAR perfSONAR perfSONAR perfSONAR 10GE 10GE 10GE 10GE Nx10GE Lab ~1 msec Nx10GE perfSONAR 10GE 100GE perfSONAR perfSONAR 10GE 10GE 10GE 100GE ESnet path ~30 msec perfSONAR perfSONAR perfSONAR 10GE 100GE 10GE 10GE 48 msec perfSONAR 10GE 10GE Clean, Fast Clean, Fast perfSONAR 100GE Campus ~1 msec 10GE Regional Path ~2 msec 10GE 10GE 100GE Internet2 path ~15 msec 85 Poorly Performing Tests Illustrate Likely Problem Areas Dirty, Slow 49 msec Border perfSONAR Science DMZ perfSONAR Science DMZ perfSONAR Border perfSONAR perfSONAR perfSONAR perfSONAR perfSONAR 10GE Dirty, Slow 10GE Nx10GE Lab ~1 msec Nx10GE perfSONAR perfSONAR 10GE 10GE 10GE 100GE ESnet path ~30 msec perfSONAR perfSONAR perfSONAR 10GE 100GE 10GE 10GE 48 msec perfSONAR 10GE 10GE Clean, Fast Clean, Fast perfSONAR 100GE Campus ~1 msec 49 msec Clean, Fast perfSONAR 100GE 10GE 10GE Clean, Fast 10GE 10GE Regional Path ~2 msec 10GE 10GE 100GE Internet2 path ~15 msec 86 Lessons From The Test Case • This testing can be done quickly if perfSONAR is already deployed • Huge productivity - Reasonable hypothesis developed quickly - Probable administrative domain identified - Testing time can be short – an hour or so at most • Without perfSONAR cases like this are very challenging • Time to resolution measured in months (I’ve worked those too) • In order to be useful for data-intensive science, the network must be fixable quickly, because stuff breaks • The Science DMZ model allows high-performance use of the network, but perfSONAR is necessary to ensure the whole kit functions well 87 Science DMZ Security Model • Goal – disentangle security policy and enforcement for science flows from security for business systems • Rationale - Science flows are relatively simple from a security perspective - Narrow application set on Science DMZ Data transfer, data streaming packages No printers, document readers, web browsers, building control systems, staff desktops, etc. - Security controls that are typically implemented to protect business resources often cause performance problems 88 Performance Is A Core Requirement • Core information security principles - Confidentiality, Integrity, Availability (CIA) - These apply to systems as well as to information, and have farreaching effects Credentials for privileged access must typically be kept confidential Systems that are faulty or unreliable are not useful scientific tools Data access is sometimes restricted, e.g. embargo before publication Some data (e.g. medical data) has stringent requirements • In data-intensive science, performance is an additional core mission requirement - CIA principles are important, but if the performance isn’t there the science mission fails - This isn’t about “how much” security you have, but how the security is implemented - We need to be able to appropriately secure systems in a way that does not compromise performance or hinder the adoption of advanced services 89 Placement Outside the Firewall • The Science DMZ resources are placed outside the enterprise firewall for performance reasons - The meaning of this is specific – Science DMZ traffic does not traverse the firewall data plane - This has nothing to do with whether packet filtering is part of the security enforcement toolkit • Lots of heartburn over this, especially from the perspective of a conventional firewall manager - Lots of organizational policy directives mandating firewalls - Firewalls are designed to protect converged enterprise networks - Why would you put critical assets outside the firewall??? • The answer is that firewalls are typically a poor fit for highperformance science applications 90 Let’s Talk About Firewalls • A firewall’s job is to enhance security by blocking activity that might compromise security - This means that a firewall’s job is to prevent things from happening - Traditional firewall policy doctrine dictates a default-deny policy Find out what business you need to do Block everything else • Firewalls are typically designed for commodity or enterprise environments - This makes sense from the firewall designer’s perspective – lots of IT spending in commodity environments - Firewall design choices are well-matched to commodity traffic profile High device count, high user count, high concurrent flow count Low per-flow bandwidth Highly capable inspection and analysis of business applications 91 Thought Experiment • We’re going to do a thought experiment • Consider a network between three buildings – A, B, and C • This is supposedly a 10Gbps network end to end (look at the links on the buildings) • Building A houses the border router – not much goes on there except the external connectivity • Lots of work happens in building B – so much so that the processing is done with multiple processors to spread the load in an affordable way, and results are aggregated after • Building C is where we branch out to other buildings • Every link between buildings is 10Gbps – this is a 10Gbps network, right??? 92 Notional 10G Network Between Buildings Building B WAN 10GE 10GE 1G 1G 1G 1G 1G perfSONAR 1G 1G 1G Building Layout To Other Buildings 1G 1G Building A 1G 1G 1G 1G 1G 1G 1G 1G 1G 1G Building C 10GE 10GE 10GE 10GE 93 Clearly Not A 10Gbps Network • If you look at the inside of Building B, it is obvious from a network engineering perspective that this is not a 10Gbps network - Clearly the maximum per-flow data rate is 1Gbps, not 10Gbps - However, if you convert the buildings into network elements while keeping their internals intact, you get routers and firewalls - What firewall did the organization buy? What’s inside it? - Those little 1G “switches” are firewall processors • This parallel firewall architecture has been in use for years - Slower processors are cheaper - Typically fine for a commodity traffic load - Therefore, this design is cost competitive and common 94 Notional 10G Network Between Devices Firewall WAN 10GE 10GE 1G 1G 1G 1G 1G perfSONAR 1G Device Layout To Other Buildings 1G 1G Border Router 1G 1G 1G 1G 1G 1G 1G 1G 1G 1G 1G 1G Internal Router 10GE 10GE 10GE 10GE 95 Notional Network Logical Diagram Border Router WAN Border Firewall 10GE 10GE 10GE perfSONAR 10GE 10GE 10GE Internal Router 96 Firewall Example • Totally protected campus, with a border firewall Performance Behind the Firewall • Blue = “Outbound”, e.g. campus to remote loca>on upload • Green = “Inbound”, e.g. download from remote loca>on What’s Inside Your Firewall? • Vendor: “But wait – we don’t do this anymore!” - It is true that vendors are working toward line-rate 10G firewalls, and some may even have them now - 10GE has been deployed in science environments for over 10 years - Firewall internals have only recently started to catch up with the 10G world - 100GE is being deployed now, 40Gbps host interfaces are available now - Firewalls are behind again • In general, IT shops want to get 5+ years out of a firewall purchase - This often means that the firewall is years behind the technology curve - Whatever you deploy now, that’s the hardware feature set you get - When a new science project tries to deploy data-intensive resources, they get whatever feature set was purchased several years ago 99 Firewall Capabilities and Science Traffic • Firewalls have a lot of sophistication in an enterprise setting - Application layer protocol analysis (HTTP, POP, MSRPC, etc.) - Built-in VPN servers - User awareness • Data-intensive science flows don’t match this profile - Common case – data on filesystem A needs to be on filesystem Z Data transfer tool verifies credentials over an encrypted channel Then open a socket or set of sockets, and send data until done (1TB, 10TB, 100TB, …) - One workflow can use 10% to 50% or more of a 10G network link • Do we have to use a firewall? 100 Firewalls As Access Lists • When you ask a firewall administrator to allow data transfers through the firewall, what do they ask for? - IP address of your host - IP address of the remote host - Port range - That looks like an ACL to me! • No special config for advanced protocol analysis – just address/port • Router ACLs are better than firewalls at address/port filtering - ACL capabilities are typically built into the router - Router ACLs typically do not drop traffic permitted by policy 101 Security Without Firewalls • Data intensive science traffic interacts poorly with firewalls • Does this mean we ignore security? NO! - We must protect our systems - We just need to find a way to do security that does not prevent us from getting the science done • Key point – security policies and mechanisms that protect the Science DMZ should be implemented so that they do not compromise performance 102 If Not Firewalls, Then What? • Remember – the goal is to protect systems in a way that allows the science mission to succeed • I like something I heard at NERSC – paraphrasing: “Security controls should enhance the utility of science infrastructure.” • There are multiple ways to solve this – some are technical, and some are organizational/sociological • I’m not going to lie to you – this is harder than just putting up a firewall and closing your eyes 103 Other Technical Capabilities • Intrusion Detection Systems (IDS) - One example is Bro – http://bro-ids.org/ - Bro is high-performance and battle-tested Bro protects several high-performance national assets Bro can be scaled with clustering: http://www.bro-ids.org/documentation/cluster.html - Host-based IDS can run on the DTN • Netflow and IPFIX can provide intelligence, but not filtering • Openflow and SDN - Using Openflow to control access to a network-based service seems pretty obvious - There is clearly a hole in the ecosystem with the label “Openflow Firewall” – I really hope someone is working on this - Could significantly reduce the attack surface for any authenticated network service - This would only work if the Openflow device had a robust data plane 104 Other Technical Capabilities (2) • Aggressive access lists - More useful with project-specific DTNs - If the purpose of the DTN is to exchange data with a small set of remote collaborators, the ACL is pretty easy to write - Large-scale data distribution servers are hard to handle this way (but then, the firewall ruleset for such a service would be pretty open too) • Limitation of the application set - One of the reasons to limit the application set in the Science DMZ is to make it easier to protect - Keep desktop applications off the DTN (and watch for them anyway using logging, netflow, etc – take violations seriously) - This requires collaboration between people – networking, security, systems, and scientists 105 The Data Transfer Trifecta: The “Science DMZ” Model Dedicated Systems for Data Transfer Data Transfer Node • • • High performance Configured for data transfer Proper tools Network Architecture Science DMZ • • • • Dedicated location for DTN Proper security Easy to deploy - no need to redesign the whole network Additional info: http://fasterdata.es.net/ Performance Testing & Measurement perfSONAR • • • Enables fault isolation Verify correct operation Widely deployed in ESnet and other networks, as well as sites and facilities 106 The Data Transfer Node • The DTN is where it all comes together - Data comes in from the network - Data received by application (e.g. Globus Online) - Data written to storage • Dedicated systems are important - Tuning for DTNs is specific - Best to avoid engineering compromises for competing applications • Hardware selection and configuration are key - Motherboard architecture - Tuning: BIOS, firmware, drivers, networking, filesystem, application • Hats off to Eric Pouyoul 107 DTN Hardware Selection • The typical engineering trade-offs between cost, redundancy, performance, and so forth apply when deciding on what hardware to use for a DTN - e.g. redundant power supplies, SATA or SAS disks, cost/ performance for RAID controllers, quality of NICs, etc. • We recommend getting a system that can be expanded to meet future storage/bandwidth requirements. - science data is growing faster than Moore Law, so might as well get a host that can support 40GE now, even if you only have a 10G network • Buying the right hardware is not enough. Extensive tuning is needed for a well optimized DTN. 108 Low performance: SuperMicro X7DWT 109 High Performance Motherboard 110 Motherboard and Chassis selection • Full 40GE requires a PCI Express gen 3 (aka PCIe gen3) motherboard - this means are limited to a Intel Sandy Bridge or Ivy Bridge host. • Other considerations are memory speed, number of PCI slots, extra cooling, and an adequate power supply. • Intel’s Sandy/Ivy Bridge (Ivy Bridge is about 20% faster) CPU architecture provides these features, which benefit a DTN: - PCIe Gen3 support ( up to 16 GB/sec ) - Turbo boost (up to 3.9 Ghz for the i7) 111 PCI Bus Considerations • Make sure the motherboard you select has the right number of slots with the right number of lanes for you planned usage. For example: - 10GE NICs require a 8 lane PCIe-2 slot - 40G/QDR NICs require a 8 lane PCIe-3 slot - HotLava has a 6x10GE NIC that requires a 16 lane PCIe-2 slot - Most RAID controllers require 8 lane PCIe-2 slot - Very high-end RAID controllers for SSD might require a 16 lane PCIe-2 slot or a 8 line PCIe-3 slot - A high-end Fusion IO ioDrive requires a 16 lane PCIe-2 slot • Some motherboards to look at include the following: • SuperMicro X9DR3-F • Sample Dell Server (Poweredge r320-r720) • Sample HP Server (ProLiant DL380p gen8 High Performance model) 112 Network Subsystem Selection • There is a huge performance difference between cheap and expensive 10G NICs. - You should not go cheap with the NIC, as a high quality NIC is important for an optimized DTN host. • NIC features to look for include: - - - - support for interrupt coalescing support for MSI-X TCP Offload Engine (TOE) support for zero-copy protocols such as RDMA (RoCE or iWARP) • Note that many 10G and 40G NICs come in dual ports, but that does not mean if you use both ports at the same time you get double the performance. Often the second port is meant to be used as a backup port. - True 2x10G capable cards include the Myricom 10G-PCIE2-8C2-2S and the Mellanox MCX312A-XCBT. 113 Tuning • Defaults are usually not appropriate for performance. • What needs to be tuned: - BIOS - Firmware - Device Drivers - Networking - File System - Application 114 Tuning Methodology • At any given time, one element in the system is preventing it from going faster. • Step 1: Unit Tuning: focusing on the DTN workflow, experiment and adjust element by element until reaching the maximum bare metal performance. • Step 2: Run application (all elements are used) and refine tuning until reaching goal performance. • A more thorough treatment of tuning is at the end of this deck (this is a 30-60 minute talk in itself!) • Tuning your DTN host is extremely important. We have seen overall IO throughput of a DTN more than double with proper tuning. • Tuning can be as much art as a science. Due to differences in hardware, its hard to give concrete advice. 115 More DTN Info • Tuning: • http://fasterdata.es.net/science-dmz/DTN/tuning/ • Reference design: • http://fasterdata.es.net/science-dmz/DTN/ reference-implementation/ 116 Globus Connect Multiuser for providers Deliver advanced data management services to researchers Provide an integrated user experience Reduce your support burden 117 Globus Connect Multiuser Globus Connect Multiuser MyProxy Online CA GridFTP Server Local Storage System (HPC cluster, campus server, …) Local system users • Create endpoint in minutes; no complex GridFTP install • Enable all users with local accounts to transfer files • Native packages: RPMs and DEBs • Also available as part of the Globus Toolkit 118 GCMU Demonstration 119 Hands-on Acess The goal of this session is to show you (hands-on) how to take a resource and turn it into a Globus Online endpoint. Each of you is provided with an amazon EC2 machine for this tutorial. Step 1 is to create a Globus Online account 120 Log into your host Your slip of paper has the host information. Log in as user “sc13”: Ssh [email protected] Use the password on the slip of paper. sc13 has passwordless sudo privileges. 121 Let’s get started… On RPM distributions: $ yum install http://www.globus.org/ftppub/gt5/5.2/stable/ installers/repo/Globus-5.2.stableconfig.fedora-18-1.noarch.rpm $ yum install globus-connect-multiuser On Debian distributions: $ curl –LOs http://www.globus.org/ftppub/gt5/5.2/stable/ installers/repo/globus-repository-5.2-stableprecise_0.0.3_all.deb $ sudo dpkg –i globus-repository-5.2-stableprecise_0.0.3_all.deb $ sudo aptitude update $ sudo aptitute –y install globus-connect-multiuser 122 GCMU Basic Configuration • Edit the GCMU configuration file: /etc/globus-connect-multiuser.conf – Set [Endpoint] Host = ”myendpointname” • Run: globus-connect-multiuser-setup – Enter your Globus Online username and password when prompted • That’s it! You now have an endpoint ready for file transfers • More configurations at: support.globusonline.org/forums/22095911 123 Access endpoint on Globus Online • Now go to your account on www.globusonline.org • Go to Start Transfer • Access the endpoint you just created ‘<your Globus Online username>#sc13dtn’ 124 Globus Online–GCMU interaction Step 1 Access Endpoint Globus Online username Step 2 password (hosted service) TLS handshake Step 3 Step 5 username password Step 6 certificate Transfer request certificate Globus Connect Multiuser MyProxy Online CA Step 9 GridFTP Server PAM Step 4 Step 7 certificate Authentication & Data Transfer Step 8 Authorization Campus Cluster username password Access files certificate Local Authentication System (LDAP, RADIUS, Kerberos, …) 125 Local Storage GridFTP Server Remote Cluster / User’s PC MyProxy OAuth server • Web-based endpoint activation – Sites run a MyProxy OAuth server o MyProxy OAuth server in Globus Connect Multiuser – Users enter username/password only on site’s webpage to activate an endpoint – Globus Online gets short-term X.509 credential via OAuth protocol • MyProxy without Oauth – Site passwords flow through Globus Online to site MyProxy server – Globus Online does not store passwords – Still a security concern for some sites 126 Globus Connect Multiuser Access Endpoint Globus Online Step 1 Step 3 username password (hosted service) Step 2 Step 7 certificate redirect Transfer request certificate Globus Connect Multiuser Step 4 OAuth Server Step 8 username password MyProxy Online CA certificate Step 6 Step 11 GridFTP Server PAM Step 5 Step 9 Step 10 Authorization Campus Cluster username password Access files certificate Local Authentication System (LDAP, RADIUS, Kerberos, …) 127 Local Storage certificate Authentication & Data Transfer GridFTP Server Remote Cluster / User’s PC Try doing the following Create a file called tutorial.txt in /home/joe Go to the GO Web UI -> Start Transfer Select endpoint username#sc13dtn Activate the endpoint as user “joe” (not sc13). You should see joe's home directory. Transfer between this endpoint and your Globus Connect or any other endpoint you have access to. 128 Making endpoint public Try to access the endpoint created by the person sitting next to you You will get the following message: ‘Could not find endpoint with name ‘xsede13dtn’ owned by user ‘<neighbor’s username>’ 129 Making endpoint public Go to your Globus Connect Multiuser server machine sudo vi /etc/globus-connect-multiuser.conf <-- uncomment “Public = False” and replace “False” with “True” Now when you access your neighbor’s endpoint, you will not get the message anymore, rather you will be prompted for username and password But you can not access it since you do not have an account on that machine 130 Enable sharing sudo vi /etc/globus-connect-multiuser.conf <-- uncomment Sharing = True Go to the GO Web UI -> Start Transfer Select endpoint username#xsede13dtn Share your home directory on this endpoint with neighbor Access your neighbor’s endpoint 131 Firewall configuration • Allow inbound connections to port 2811 (GridFTP control channel), 7512 (MyProxy CA), 443 (OAuth) • Allow inbound connections to ports 50000-51000 (GridFTP data channel) – If transfers to/from this machine will happen only from/ to a known set of endpoints (not common), you can restrict connections to this port range only from those machines • If your firewall restricts outbound connections – Allow outbound connections if the source port is in the range 50000-51000 132 Enable your resource. It’s easy. • Signup: globusonline.org/signup • Connect your system: globusonline.org/gcmu • Learn: support.globusonline.org/forums/22095911 • Need help? support.globusonline.org • Follow us: @globusonline 133 End of GCMU Basic Tutorial 134 GCMU Advanced Configuration • Customizing filesystem access • Using host certificates • Using CILogon certificates • Configuring multiple GridFTP servers • Setting up an anonymous endpoint 135 Path Restriction • Default configuration: – All paths allowed – Access control handled by the OS • Use RestrictPaths to customize – Specifies a comma separated list of full paths that clients may access – Each path may be prefixed by R (read) and/or W (write), or N (none) to explicitly deny access to a path – '~’ for authenticated user’s home directory, and * may be used for simple wildcard matching. • E.g. Full access to home directory, read access to /data: – RestrictPaths = RW~,R/data • E.g. Full access to home directory, deny hidden files: – RestrictPaths = RW~,N~/.* 136 Sharing Path Restriction • Define additional restrictions on which paths your users are allowed to create shared endpoint • Use SharingRestrictPaths to customize – Same syntax as RestrictPaths • E.g. Full access to home directory, deny hidden files: – RestrictPaths = RW~,N~/.* • E.g. Full access to public folder under home directory: – RestrictPaths = RW~/public • E.g. Full access to /project, read access to /scratch: – RestrictPaths = RW/project,R/scratch 137 Per user sharing control • Default: SharingFile = False – Sharing is enabled for all users when Sharing = True • SharingFile = True – Sharing is enabled only for users who have the file ~/.globus_sharing • SharingFile can be set to an existing path in order for sharing to be enabled – e.g. enable sharing for any user for which a file exists in /var/ globusonline/sharing/: – SharingFile = "/var/globusonline/sharing/$USER” 138 Using a host certificate for GridFTP • Comment – FetchCredentialFromRelay = True • Set – CertificateFile = <path_to_host_certificate> – KeyFile = <path_to_private key_associated_with_host_certificate> – TrustedCertificateDirectory = <path_to_trust_roots> 139 Use CILogon issued certificates for authentication • Your organization must allow CILogon to release ePPN attribute in the certificate • Set AuthorizationMethod = CILogon in the globus connect multiuser configuration • Set CILogonIdentityProvider = <your_institution_as_listed_in_CILogo n_identity_provider_list> • Add CILogon CA to your trustroots (/var/lib/ globus-connect-multiuser/grid-security/ certificates/) 140 Setting up additional GridFTP servers for your endpoint • curl -LOs http://www.globus.org/ftppub/gt5/5.2/stable/ installers/repo/globus-repository-5.2-stableoneiric_0.0.3_all.deb • sudo dpkg -i globus-repository-5.2-stableoneiric_0.0.3_all.deb • sudo aptitude update • sudo aptitude -y install globus-connect-multiuser • sudo vi /etc/globus-connect-multiuser.conf • <-- comment ‘Server = %(HOSTNAME)s’ in ‘MyProxy Config’ • Copy contents of ‘/var/lib/globus-connect-multiuser/grid-security/ certificates/’ from the first machine to same location on this machine • sudo globus-connect-multiuser-setup <-- enter Globus Online username and password 141 Setting up additional GridFTP servers for your endpoint $ curl –LOs http://www.globus.org/ftppub/gt5/5.2/stable/ installers/repo/globus-repository-5.2-stableprecise_0.0.3_all.deb $ dpkg –I globus-repository-5.2-stableprecise_0.0.3_all.deb $ sudo aptitude update $ sudo aptitute –y install globus-connect-multiuser $ sudo vi /etc/globus-connect-multiuser.conf • Comment Server = %(HOSTNAME)s in [MyProxy] section • Copy contents of ‘/var/lib/globus-connect-multiuser/grid-security/ certificates/’ from the first machine to same location on this machine $ sudo globus-connect-multiuser-setup • Enter Globus Online username and password when prompted 142 Setting up an anonymous endpoint $ globus-gridftp-server –aa –anonymous-user <user> • –anonymous-user <user> needed if run as root $ endpoint-add <name> -p ftp://<host>:<port> $ endpoint-modify --myproxy-server=myproxy.globusonline.org 143 End of Advanced GCMU Tutorial 144 Dataset Services Sharing Service Transfer Service Globus Nexus (Identity, Group, Profile) Globus Toolkit 145 Globus Connect … Globus Online APIs Globus Platform-as-a-Service Branded website integration 146 Integrating with Globus identity/group management services 147 Portal/application integration 148 Our research is supported by: U.S. DEPARTMENT OF ENERGY 149 We are a non-profit service provider to the non-profit research community 150 We are a non-profit service provider to the non-profit research community Our challenge: Sustainability 151 Globus Online End User Plans • Basic: Free – File transfer and synchronization to/from servers – Create private and public endpoints – Access to shared endpoints created by others • Plus: $7/month (or $70/year) – Create and manage shared endpoints – Peer-to-peer transfer and sharing 152 Globus Online Provider Plans Bundle of Plus subscriptions + Features for providers http://globusonline.org/provider-plans (Working on NET+ offering) 153 Move. Sync. Share. It’s easy. • Signup: globusonline.org/signup • Connect your system: globusonline.org/globus_connect • Learn: support.globusonline.org/forums • Need help? support.globusonline.org • Follow us: @globusonline 154 NCSA Case Study @ SC2013 Jason Alt Q: How can we solve our data management needs as we jump to the next scale of compu9ng? 156 Begin By Creating A Wish List • • • • • Submit and forget requests Automate the transfers Retry failed transfers Load balance across available hardware Let the user know when it’s done Take the burden off the average user 157 How Had We Been Handling This? • Handful of TF clusters with a few PB of disk and less than 100Gb of external networking • 5PB archive with SSH and FTP access • Password-less access from cluster to archive • • • • Command line based tool (mssftp/msscmd) Automatic retry of transfers until complete Site-specific syntax Single node of operation 158 Are We Satisfying Our Wish List? • • • • • Submit and forget requests – NO Automated transfers - YES Failure retry - YES Load balancing – NO User notification - NO It certainly is not cool or fun to use • High admin overhead and user frustration 159 Q: Will these tools work with the next scale of compu9ng? 160 Blue Waters 11Petaflop System FDR IB 136GB/s FDR IB 28 x Dell R720 IE nodes 2 x 2.1GHz w/ 8 cores 1 x 40GbE 36 x Sonexion 6000 Lustre 2.1: > 26.4PB @ > 1TB/s 161 HPSS Disk Cache 4 x DDN 12k 2.4PB @ 100GB/s 400GB/s Blue Waters Mover nodes 50 x Dell R720 2 x 2.9GHz w/ 8 cores 2 x 40GbE (Bonded) 6 x Spectra Logic T-‐Finity 12 robo>c arms 360PB in 95580 slots 366 TS1140 Jaguars (85GB/s) 162 Will We Satisfy Our Wish List? • • • • • Submit and forget requests - NO Automated transfers - YES Failure retry - YES Load balancing – NO (now very necessary!) User notification – NO Larger data sets mean even more frustration 163 Q: How can we solve our data management needs as we jump to the next scale of compu9ng? 164 Leverage Existing Tools That Work • NCSA has a large GSI and GridFTP infrastructure • All of our clusters use GridFTP • Our legacy mass storage system uses GridFTP • All of our users are accustomed to using GridFTP • Redesign tools for very large data sets • Easy to use • Fault tolerant 165 Project Code Name Import/Export • We designed an automated transfer mechanism for Blue Waters • Support external and internal transfers • Support load balancing, retry on failure • Support CLI and eventually web • Requirements done, design almost done, almost time to code and then… Have you heard of Globus Online? 166 167 168 169 Will We Satisfy Our Wish List? • • • • • Submit and forget requests - YES Automated transfers - YES Failure retry - YES Load balancing – YES User notification – YES No coding necessary 170 Partnering With Globus Online • No need to create our own solution • Leverage solutions to problems from other sites • So many new features in the last year or more: • Flight control, round robin, increased concurrency • Blue Waters GO portal with OAuth • And much more in the pipeline • Regular expressions, namespace operations, intelligent staging from tape for archives 171 Q: How can we solve our data management needs as we jump to the next scale of compu9ng? 172 A: Partnering with Globus Online has helped us achieve many of our data manage goals. 173 NCSA Use Cases For Globus Online • Almost all transfers now us Globus Online • Blue Waters and iForge have adopted it as the supported transfer mechanism • Legacy read-only archive uses it as users move data off before it shuts down • LSST project has begun to use it • Most transfers originate from the web interface • These use cases are really straightforward 174 Blue Waters Lustre Backups • 2PB of disk backed up night • Archive files created on Blue Waters and transferred to HPSS with Globus Online • Incremental backups complete in 4-6 hours • Written in Python using the GO RESTful API • Submits on archive per GO task • Makes individual archive completion easy to track • Fits with design to parallelize backups 175 Missing Integration With Existing Workflows • Power users with existing workflows have been reluctant to change • Want Blue Waters to behave like other clusters • Don’t want new steps for file transfer component • NCSA setup GRAM and round robin DNS to mimic GO-like functionality (lots of work!) • Use case loses load balancing & fault tolerance • More difficult for NCSA to monitor and assist • Need adoption from these science teams 176 Acknowledgements This research is part of the Blue Waters sustained petascale computing project, which is supported by the National Science Foundation (award number OCI 07-25070) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign, its National Center for Supercomputing Applications, and the Great Lakes Consortium for Petascale Computation. 177 Recap of afternoon session and wrap-up Q&A 178 Extra slides 179 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science Storage Architectures There are multiple options for DTN Storage DTN Local storage Raid, SSD Storage system Fiber Distributed file system Infiniband, Ethernet 180 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science Storage Subsystem Selection • Deciding what storage to use in your DTN is based on deciding what you are optimizing for: • performance, reliability, and capacity, and cost. • SATA disks historically have been cheaper and higher capacity, while SAS disks typically have been the fastest. • However these technologies have been converging, and with SATA 3.1 is less true. 181 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science SSD Storage: on its way to become mainstream. • PCIe Gen3 removes the bottleneck preventing fast I/O • New product range: “pro consumer” with interesting pricing. • Larger choice of vendors and packaging (card vs hdd form factor) • Better operating system support (both performance and reliability) • Published third party benchmarks 182 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science SSD: Current State • Performance and reliability varies • Mid-range SSD (consumer) is affordable (roughly 10 times more expensive for 10 times the performance of HDD) • Two packaging: HDD form-factor or PCIe card • Specific tuning must be done for both reliability and performance. 183 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science RAID Controllers • Often optimized for a given workload, rarely for performance. • RAID0 is the fastest of all RAID levels but is also the least reliable. • The performance of the RAID controller is a factor of the number of drives and its own processing engine. 184 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science RAID Controllers (2) Be sure your RAID controller has the following: • 1GB of on-board cache • PCIe Gen3 support • dual-core RAID-on-Chip (ROC) processor if you will have more than 8 drives One example of a RAID card that satisfies these criteria is the Areca ARC-1882i or LSI 9271. When using SSD, make sure that the SSD brand and model is supported by the controller, otherwise, some optimization will not work. 185 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science Tuning Defaults are usually not appropriate for performance. What needs to be tuned: • BIOS • Firmware • Device Drivers • Networking • File System • Application 186 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science Tuning Methodology At any given time, one element in the system is preventing it from going faster. Step 1: Unit Tuning: focusing on the DTN workflow, experiment and adjust element by element until reaching the maximum bare metal performance. Step 2: Run application (all elements are used) and refine tuning until reaching goal performance. 187 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science DTN Tuning Tuning your DTN host is extremely important. We have seen overall IO throughput of a DTN more than double with proper tuning. Tuning can be as much art as a science. Due to differences in hardware, its hard to give concrete advice. Here are some tuning settings that we have found do make a difference. • This tutorial assumes you are running a Redhat-based Linux system, but other types of Unix should have similar tuning nobs. Note that you should always use the most recent version of the OS, as performance optimizations for new hardware are added to every release. 188 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science BIOS Tuning For PCI gen3-based hosts, you should • enable “turbo boost”, • disable hyperthreading and node interleaving. More information on BIOS tuning is described in this document: http://www.dellhpcsolutions.com/assets/pdfs/ Optimal_BIOS_HPC_Dell_12G.v1.0.pdf 189 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science RAID Controller Different RAID controllers provide different tuning controls. Check the documentation for your controller and use the settings recommended to optimize for large file reading. • You will usually want to disable any “smart” controller built-in options, as they are typically designed for different workflows. Here are some settings that we found increase performance on a 3ware RAID controller. These settings are in the BIOS, and can be entered by pressing Alt+3 when the system boots up. • Write cache – Enabled • Read cache – Enabled • Continue on Error – Disabled • Drive Queuing – Enabled • StorSave Profile – Performance • Auto-verify – Disabled • Rapid RAID recovery – Disabled 190 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science I/O tuning: vmstat, mpstat. From man page: reports information about processes, memory, paging, block IO, traps, and cpu activity. • Shown true I/O operation • Shows CPU bottlenecks • Shows memory usage • Shows locks $ vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 0 0 22751248 192800 1017000 0 0 4 7 0 0 100 0 0 191 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science Simulation & Benchmarking I/O: “dd test” # dd of=/dev/null if=/storage/data1/test-file1 bs=4k & # dd of=/dev/null if=/storage/data1/test-file2 bs=4k & # dd of=/dev/null if=/storage/data2/test-file1 bs=4k & # dd of=/dev/null if=/storage/data2/test-file2 bs=4k & # dd of=/dev/null if=/storage/data3/test-file1 bs=4k & # dd of=/dev/null if=/storage/data3/test-file2 bs=4k & 192 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science Example vmstat / dd # vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----r b swpd free buff cache si so bi 0 bo in cs us sy id wa st 6 0 0 150132 215204 23428260 0 2 3 0 1692948 218924 21920000 0 0 4428 499712 24599 5341 1 29 65 6 0 2 5 0 1610216 222512 22001264 0 0 3532 725012 25230 5363 0 15 75 10 0 4 5 0 720020 224532 22865412 3 7 0 1917556 225440 21686980 0 0 1672 1099036 27333 4314 0 17 60 23 0 6 7 0 1419324 225496 22180252 0 0 3 6 0 391860 225560 23182336 0 0 4 1261536 25797 27532 0 20 48 32 0 8 4 0 80624 224672 23486864 0 0 0 1296932 26799 3373 0 22 52 26 0 3 6 0 88860 224184 23475516 0 0 0 1322248 28338 3529 0 22 51 27 0 0 0 0 16431 2245 0 13 86 0 0 0 2048 847296 24566 4277 0 13 65 22 0 0 1312704 29410 25386 0 24 45 31 0 193 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science I/O Scheduler The default Linux scheduler is the "fair" scheduler. For a DTN node, we recommend using the "deadline" scheduler instead. To enable deadline scheduling, add "elevator=deadline" to the end of the "kernel' line in your /boot/grub/grub.conf file, similar to this: !kernel /vmlinuz-2.6.35.7 ro root=/dev/VolGroup00/ LogVol00 rhgb quiet elevator=deadline! 194 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science Block Device Tuning: readahead Increasing the amount of "readahead" usually helps on DTN nodes where the workflow is mostly sequential reads. • However you should test this, as some RAID controllers do this already, and changing this may have adverse affects. Setting readahead should be done at system boot time. For example, add something like this to /etc/rc.d/rc.local: ! /sbin/blockdev --setra 262144 /dev/sdb! More information on readahead: http://www.kernel.org/doc/ols/2004/ols2004v2-pages-105-116.pdf 195 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science EXT4 Tuning The file system should be tuned to the physical layout of the drives. • Stride and stripe-width are used to align the volume according to the stripe-size of the RAID. − stride is calculated as Stripe Size / Block Size. − stripe-width is calculated as Stride * Number of Disks Providing Capacity. Disabling journaling will also improve performance, but reduces reliability. Sample mkfs command: !/sbin/mkfs.ext4 /dev/sdb1 –b 4096 –E stride=64 \ stripe-width=768 –O ^has_journal! 196 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science File System Tuning (cont.) There are also tuning settings that are done at mount time. Here are the ones that we have found improve DTN performance: • data=writeback − this option forces ext4 to use journaling only for metadata. This gives a huge improvement in write performance • inode_readahead_blks=64 − this specifies the number of inode blocks to be read ahead by ext4’s readahead algorithm. Default is 32. • Commit=300 − this parameter tells ext4 to sync its metadata and data every 300s. This reduces the reliability of data writes, but increases performance. • noatime,nodiratime − these parameters tells ext4 not to write the file and directory access timestamps. Sample fstab entry: !/dev/sdb1 /storage/data1 ext4 inode_readahead_blks=64,data=writeback,barrier=0,commit=300,noa time,nodiratime! For more info see: http://kernel.org/doc/Documentation/filesystems/ext4.txt 197 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science SSD Issues Tuning your SSD is more about reliability and longevity than performance, as each flash memory cell has a finite lifespan that is determined by the number of "program and erase (P/E)" cycles. Without proper tuning, SSD can die within months. • never do "write" benchmarks on SSD: this will damage your SSD quickly. Modern SSD drives and modern OSes should all includes TRIM support. Only the newest RAID controllers include TRIM support (late 2012). Memory Swap To prolong SSD lifespan, do not swap on an SSD. In Linux you can control this using the sysctl variable vm.swappiness. E.g.: add this to /etc/sysctl.conf: − vm.swappiness=1 This tells the kernel to avoid unmapping mapped pages whenever possible. Avoid frequent re-writing of files (for example during compiling code from source), use a ramdisk file system (tmpfs) for /tmp /usr/tmp, etc. 198 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science EXT4 file system tuning for SSD These mount flags should be used for SSD partitions. • noatime: Reading accesses to the file system will no longer result in an update to the atime information associated with the file. This eliminates the need for the system to make writes to the file system for files which are simply being read. • discard: This enables the benefits of the TRIM command as long for kernel version >=2.6.33. Sample /etc/fstab entry: /dev/sda1 /home/data ext4 defaults,noatime,discard 0 1! For more information see: http://wiki.archlinux.org/index.php/Solid_State_Drives 199 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science Network Tuning # add to /etc/sysctl.conf net.core.rmem_max = 33554432 # make sure cubic and htcp are loaded net.core.wmem_max = 33554432 /sbin/modprobe tcp_htcp net.ipv4.tcp_rmem = 4096 87380 33554432 /sbin/modprobe tcp_cubic net.ipv4.tcp_wmem = 4096 65536 33554432 # set default to CC alg to htcp net.core.netdev_max_backlog = 250000 net.ipv4.tcp_congestion_control=htcp Add to /etc/rc.local # with the Myricom 10G NIC increasing interrupt coalencing helps a lot: # increase txqueuelen /sbin/ifconfig eth2 txqueuelen 10000 /usr/sbin/ethtool -C ethN rx-usecs 75 /sbin/ifconfig eth3 txqueuelen 10000 And use Jumbo Frames! 200 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science NIC Tuning As the bandwidth of a NIC goes up (10G, 40G), it becomes critical to fine tune the NIC. • Lot of interrupts per second (IRQ) • Network protocols requires to process data (CRC) and copy data from/to application’s space. • NIC may interfere with other components such as the RAID controller. Do not forget to tune TCP and other network parameters as described previously. 201 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science Interrupt Affinity • Interrupt handlers are executed on a core • Depending on the scheduler, core 0 gets all the interrupts, or interrupts are dispatched in a round-robin fashion among the cores: both are bad for performance: • Core 0 get all interrupts: with very fast I/O, the core is overwhelmed and becomes a bottleneck • Round-robin dispatch: very likely the core that executes the interrupt handler will not have the code in its L1 cache. • Two different I/O channels may end up on the same core. 202 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science A simple solution: interrupt binding - Each interrupt is statically bound to a given core (network -> core 1, disk -> core 2) - Works well, but can become an headache and does not fully solve the problem: one very fast card can still overwhelm the core. - Needs to bind application to the same cores for best optimization: what about multi-threaded applications, for which we want one thread = one core ? Linux by default runs “irqbalanced”, a daemon that automatically balances interrupts across the cores. It needs to be disabled if you bind interrupts manually. 203 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science Study case: Areca 1882 versus LSI 9271 • Both are PCIe 3.0 • Both uses the same processor (dual core ROC) • Similar hardware design However, both cards performs very differently due to their firmware and drivers: while the Areca card seems to have better RAID tuning knobs, it does not allow more than one interrupt to be used for I/O. The LSI can balance its I/O onto several interrupts, effectively dispatching the load onto several cores. Because of this problem and when using SSD, the LSI card delivers about three times more I/O performance. 204 Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
© Copyright 2026 Paperzz