Globus Online and the Science DMZ

Globus Online and the Science DMZ:
Scalable Research Data Management
Infrastructure for HPC Facilities
Steve Tuecke, Raj Kettimuthu, and Vas Vasiliadis
University of Chicago and Argonne National Laboratory
Eli Dart
ESnet
SC13 – Denver, CO
November 17, 2013
What if the research work flow
could be managed as easily
as…
…our pictures
…our e-mail
…home entertainment
2
What makes these services
great?
Great User Experience
+
Scalable (but invisible)
infrastructure
3
We aspire to create a great user
experience for
research data management
What would a “dropbox for
science” look like?
4
Managing data should be easy …
Registry
Staging
Store
Ingest
Store
Community
Store
Analysis
Store
Archive
Mirror
5
… but it’s hard and frustrating!
!
!
Staging
Store
Permission
denied
Registry
Expired
Ingest
credentials
Store
!
Community
Store
!
Failed
Server / Net
Analysis
Store
NATs
Firewalls
Archive
Mirror
6
Usability AND Performance
•  Three things are all required for data transfer tools
1. 
2. 
3. 
Reliability
Consistency
Performance
• 
This is an ordered list!
• 
• 
Q: How do we get a good toolset deployed on a reliable
platform that consistently achieves high performance?
A: Architecture à Foundation à Toolset
• 
First some background, then we’ll define that architecture
7
Long Distance Data Transfers
•  Long distance data movement is different from other HPC center
services
-  HPC facility only controls one end of the end to end path
-  Distances are measured in milliseconds, not nanoseconds
-  Software tools are different
•  As data sets grow (GB à TB à PB à EB à …) performance
requirements increase
-  Human time scale does not change with data scale
-  Automation requirements increase with data scale
•  Default configurations are no longer sufficient
-  Networks, systems, storage and tools must be built and deployed properly
(you wouldn’t deploy a supercomputer without tuning it)
-  Architecture must account for governing dynamics of data transfer applications
Required Components
•  Long distance data services require several things
-  High performance networks
Sufficient capacity
No packet loss
-  Well-configured systems at the ends
-  Tools that scale
•  Components must be assembled correctly
-  Ad-hoc deployments seldom work out well
-  Details matter
•  If we understand the constraints and work within
them, we are more likely to be consistently
successful
Boundaries And Constraints
•  Technology changes quickly – people and protocols change slowly
-  Increases in data, no change in people
-  Better tools, but using the same underlying protocols
•  Tools must be easy for people to use
-  A scientist’s time is very valuable
-  Computers are good at certain details
Management of data transfers
Failure recovery
Manipulation of thousands of files
-  Tools must easily allow scientists to scale themselves up
•  Protocols are hard to change
-  The Internet protocol suite is what we have to work with
-  Make sure data transfer architecture accounts for protocol behavior
TCP Background
•  Networks provide connectivity between hosts – how
do hosts see the network?
-  From an application’s perspective, the interface to “the other
end” is a socket
-  Communication is between applications – mostly over TCP
•  TCP – the fragile workhorse
-  TCP is (for very good reasons) timid – packet loss is
interpreted as congestion
-  Packet loss in conjunction with latency is a performance
killer
-  Like it or not, TCP is used for the vast majority of data
transfer applications
11
TCP Background (2)
•  It is far easier to architect the network to support TCP
than it is to fix TCP
-  People have been trying to fix TCP for years – limited success
-  Here we are – packet loss is still the number one performance
killer in long distance high performance environments
•  Pragmatically speaking, we must accommodate TCP
-  Implications for equipment selection
Provide loss-free service to TCP
Equipment must be able to accurately account for packets
-  Implications for network architecture, deployment models
Infrastructure must be designed to allow easy troubleshooting
Test and measurement tools are critical
12
Soft Failures
•  An automatic network test alerted ESnet to reduced/poor performance on
some of our network high latency paths
•  Routers did not detect any issues
-  perfSONAR/specialized performance software detected the problem
•  PerfSONAR measured a 0.0046% packet loss
-  1 packet in 22000 packets
•  Performance impact of this: (outbound/inbound)
-  To/from test host 1 ms RTT : 7.3 Gbps out / 9.8 Gbps in
-  To/from test host 11 ms RTT: 1.2 Gbps out / 9.5 Gbps in
-  To/from test host 51ms RTT: 800 Mbps out / 9 Gbps in
-  To/from test host 88 ms RTT: 500 Mbps out / 8 Gbps in
CONCLUSION: Data transfer across the US was more than 16 times
slower than it should have been. Science is impacted, researchers get
frustrated.
13
A small amount of packet loss makes a
huge difference in TCP performance
14
How Do We Accommodate TCP?
•  High-performance wide area TCP flows must get loss-free service
-  Sufficient bandwidth to avoid congestion
-  Deep enough buffers in routers and switches to handle bursts
Especially true for long-distance flows due to packet behavior
No, this isn’t buffer bloat
•  Equally important – the infrastructure must be verifiable so that clean
service can be provided
-  Stuff breaks
Hardware, software, optics, bugs, …
How do we deal with it in a production environment?
-  Must be able to prove a network device or path is functioning correctly
Accurate counters must exist and be accessible
Need ability to run tests – perfSONAR
-  Small footprint is a huge win – small number of devices so that problem
isolation is tractable
The Data Transfer Trifecta:
The “Science DMZ” Model
Dedicated
Systems for
Data Transfer
Data Transfer Node
• 
• 
• 
High performance
Configured for data
transfer
Proper tools, such as
Globus Online
Network
Architecture
Science DMZ
• 
• 
• 
• 
Dedicated location for DTN
Proper security
Easy to deploy - no need to
redesign the whole network
Additional info:
http://fasterdata.es.net/
Performance
Testing &
Measurement
perfSONAR
• 
• 
• 
Enables fault isolation
Verify correct operation
Widely deployed in
ESnet and other
networks, as well as
sites and facilities
Science DMZ Takes Many Forms
•  There are a lot of ways to combine these things – it
all depends on what you need to do
-  Small installation for a project or two
-  Facility inside a larger institution
-  Institutional capability serving multiple departments/divisions
-  Science capability that consumes a majority of the
infrastructure
•  Some of these are straightforward, others are less
obvious
•  Key point of concentration: High-latency path for TCP
17
The Data Transfer Trifecta:
The “Science DMZ” Model
Dedicated
Systems for
Data Transfer
Data Transfer Node
• 
• 
• 
High performance
Configured for data
transfer
Proper tools
Network
Architecture
Science DMZ
• 
• 
• 
• 
Dedicated location for DTN
Proper security
Easy to deploy - no need to
redesign the whole network
Additional info:
http://fasterdata.es.net/
Performance
Testing &
Measurement
perfSONAR
• 
• 
• 
Enables fault isolation
Verify correct operation
Widely deployed in
ESnet and other
networks, as well as
sites and facilities
Small-scale or Prototype Science DMZ
•  Add-on to existing network infrastructure
-  All that is required is a port on the border router
-  Small footprint, pre-production commitment
•  Easy to experiment with components and
technologies
-  DTN prototyping
-  perfSONAR testing
•  Limited scope makes security policy exceptions easy
-  Only allow traffic from partners
-  Add-on to production infrastructure – lower risk
19
Science DMZ – Basic Abstract Cartoon
Border Router
perfSONAR
WAN
10G
Enterprise Border
Router/Firewall
10GE
10GE
Site / Campus
access to Science
DMZ resources
Clean,
High-bandwidth
WAN path
perfSONAR
10GE
Site / Campus
LAN
Science DMZ
Switch/Router
10GE
perfSONAR
Per-service
security policy
control points
High performance
Data Transfer Node
with high-speed storage
Abstract Science DMZ Data Path
Border Router
perfSONAR
WAN
10G
Enterprise Border
Router/Firewall
10GE
10GE
Site / Campus
access to Science
DMZ resources
Clean,
High-bandwidth
WAN path
perfSONAR
10GE
Site / Campus
LAN
Science DMZ
Switch/Router
10GE
perfSONAR
Per-service
security policy
control points
High Latency WAN Path
High performance
Data Transfer Node
with high-speed storage
Low Latency LAN Path
Supercomputer Center Deployment
•  High-performance networking is assumed in this environment
-  Data flows between systems, between systems and storage, wide
area, etc.
-  Global filesystem often ties resources together
Portions of this may not run over Ethernet (e.g. IB)
Implications for Data Transfer Nodes
•  “Science DMZ” may not look like a discrete entity here
-  By the time you get through interconnecting all the resources, you end
up with most of the network in the Science DMZ
-  This is as it should be – the point is appropriate deployment of tools,
configuration, policy control, etc.
•  Office networks can look like an afterthought, but they aren’t
-  Deployed with appropriate security controls
-  Office infrastructure need not be sized for science traffic
22
Supercomputer Center
Border Router
WAN
Firewall
Routed
Offices
perfSONAR
Virtual
Circuit
perfSONAR
Core
Switch/Router
Front end
switch
Front end
switch
perfSONAR
Data Transfer
Nodes
Supercomputer
Parallel Filesystem
Supercomputer Center
Border Router
WAN
Firewall
Routed
Offices
perfSONAR
Virtual
Circuit
perfSONAR
Core
Switch/Router
Front end
switch
Front end
switch
perfSONAR
Data Transfer
Nodes
High Latency WAN Path
Supercomputer
Low Latency LAN Path
Parallel Filesystem
High Latency VC Path
Major Data Site Deployment
•  In some cases, large scale data service is the
major driver
-  Huge volumes of data – ingest, export
-  Individual DTNs don’t exist here – data transfer clusters
•  Single-pipe deployments don’t work
-  Everything is parallel
Networks (Nx10G LAGs, soon to be Nx100G)
Hosts – data transfer clusters, no individual DTNs
WAN connections – multiple entry, redundant equipment
-  Choke points (e.g. firewalls) cause problems
25
Data Site – Architecture
Virtual
Circuit
VC
Provider Edge
Routers
WAN
Virtual
Circuit
perfSONAR
Data Transfer
Cluster
Border
Routers
HA
Firewalls
VC
perfSONAR
Site/Campus
LAN
Data Service
Switch Plane
perfSONAR
Data Site – Data Path
Virtual
Circuit
VC
Provider Edge
Routers
WAN
Virtual
Circuit
perfSONAR
Data Transfer
Cluster
Border
Routers
HA
Firewalls
VC
perfSONAR
Site/Campus
LAN
Data Service
Switch Plane
perfSONAR
Common Threads
•  Two common threads exist in all these examples
•  Accommodation of TCP
-  Wide area portion of data transfers traverses purpose-built path
-  High performance devices that don’t drop packets
•  Ability to test and verify
-  When problems arise (and they always will), they can be solved if
the infrastructure is built correctly
-  Small device count makes it easier to find issues
-  Multiple test and measurement hosts provide multiple views of the
data path
perfSONAR nodes at the site and in the WAN
perfSONAR nodes at the remote site
The Data Transfer Trifecta:
The “Science DMZ” Model
Dedicated
Systems for
Data Transfer
Data Transfer Node
• 
• 
• 
High performance
Configured for data
transfer
Proper tools
Network
Architecture
Science DMZ
• 
• 
• 
• 
Dedicated location for DTN
Proper security
Easy to deploy - no need to
redesign the whole network
Additional info:
http://fasterdata.es.net/
Performance
Testing &
Measurement
perfSONAR
• 
• 
• 
Enables fault isolation
Verify correct operation
Widely deployed in
ESnet and other
networks, as well as
sites and facilities
Test and Measurement – Keeping the
Network Clean
•  The wide area network, the Science DMZ, and all its
systems start out functioning perfectly
•  Eventually something is going to break
-  Networks and systems are built with many, many
components
-  Sometimes things just break – this is why we buy
support contracts
•  Other problems arise as well – bugs, mistakes, whatever
•  We must be able to find and fix problems when they occur
•  Why is this so important? Because we use TCP!
perfSONAR
•  All the network diagrams have little perfSONAR boxes
everywhere
-  The reason for this is that consistent behavior requires correctness
-  Correctness requires the ability to find and fix problems
You can’t fix what you can’t find
You can’t find what you can’t see
perfSONAR lets you see
•  Especially important when deploying high performance
services
-  If there is a problem with the infrastructure, need to fix it
-  If the problem is not with your stuff, need to prove it
Many players in an end to end path
Ability to show correct behavior aids in problem localization
31
What is perfSONAR?
•  perfSONAR is a tool to:
-  Set network performance expectations
-  Find network problems (“soft failures”)
-  Help fix these problems
• All in multi-domain environments
-  Widely deployed international capability
-  Easy for HPC centers to attach to
Leverage international test and measurement
Reduce troubleshooting overhead (no need for just-in-time tools
deployment
32
perfSONAR
-  perfSONAR is a critical toolset for maintaining highperformance data transfer infrastructure
-  http://www.perfsonar.net/
-  http://fasterdata.es.net/performance-testing/perfsonar/
-  More information this afternoon
-  There is also a perfSONAR tutorial tomorrow
Ensuring Network Performance with perfSONAR
http://sc13.supercomputing.org/content/tutorials
The Data Transfer Trifecta:
The “Science DMZ” Model
Dedicated
Systems for
Data Transfer
Data Transfer Node
• 
• 
• 
High performance
Configured for data
transfer
Proper tools
Network
Architecture
Science DMZ
• 
• 
• 
• 
Dedicated location for DTN
Proper security
Easy to deploy - no need to
redesign the whole network
Additional info:
http://fasterdata.es.net/
Performance
Testing &
Measurement
perfSONAR
• 
• 
• 
Enables fault isolation
Verify correct operation
Widely deployed in
ESnet and other
networks, as well as
sites and facilities
The Data Transfer Node
•  The Data Transfer Node is a high-performance host, typically a
Linux PC server, that is optimized for a data mobility workload
•  A DTN server is made of several subsystems. Each needs to
perform optimally for the DTN workflow:
•  Storage: capacity, performance, reliability, physical footprint
•  Networking: protocol support, optimization, reliability
•  Motherboard: I/O paths, PCIe subsystem, IPMI
•  DTN is dedicated to data transfer, and not doing data analysis/
manipulation
•  In HPC centers, DTNs typically mount a global filesystem
-  Best deployed in sets (e.g. 4-8 DTNs)
-  Globus Online can use multiple DTNs
DTN Toolset
•  DTNs are built for wide area data movement
-  Entry point for external data
-  Well-configured for consistent performance
-  Need optimized tools
•  Remember from before – we need a powerful toolset with
advanced features to handle data scale
- 
- 
- 
- 
Easy user interface
Management of data transfers
Failure recovery
Manipulation of thousands (or millions, or …) of files
•  Globus Online provides these features – ideal DTN
toolset
Globus Online: Software-as-a-Service for
Research Data Management
37
What is Globus Online?
Big data transfer
and sharing…
…with Dropbox-like
simplicity…
…directly from your own
storage systems
38
Reliable, secure, high-performance
file transfer and synchronization
•  “Fire-and-forget”
transfers
2 Globus Online
•  Automatic fault
recovery
Data
Source
moves and
syncs files
Data
Destination
•  Seamless security
integration
1 User initiates
transfer
request
3
Globus Online
notifies user
39
Simple, secure sharing off existing
storage systems
•  Easily share large data
with any user or group
2
•  No cloud storage
required
1
User A selects
file(s) to share,
selects user or
group, and sets
permissions
40
Globus Online tracks
shared files; no need
to move files to cloud
storage!
Data
Source
3
User B logs in to
Globus Online
and accesses
shared file
Globus Online is SaaS
•  Web, command line, and REST interfaces
•  Reduced IT operational costs
•  New features automatically available
•  Consolidated support & troubleshooting
•  Easy to add your laptop, server, cluster,
supercomputer, etc. with Globus Connect
41
Globus Online For HPC Centers
•  Globus Online is easy for users to use
•  How might a HPC center make large scale
science data available via Globus Online?
•  The Science DMZ model provides a set of design
patterns
-  High-performance foundation
-  Enable next-generation data services for science
-  Ideal platform for Globus Online deployment
42
Demonstration
File transfer, sharing and
group management
43
Exercise 1
Globus Online signup and
configuration
44
Exercise 2
Globus Online file transfer
and sharing
45
David Skinner, Shreyas Cholia, Jason Hick, Matt Andrews
NERSC, LBNL
NERSC Use Case
Networked Data @ NERSC 2013
4PB JGI File
Systems
5PB /project
250TB
/global/homes
4PB
/global/scratch
IB
Subnet
1
GPFS
Servers
DTNs
Carver
pNSDs
Genepool
IB
Subnet
2
PDSF
Hopper
DVS
Edison
DVS
47
-
WAN Data Transfer @ NERSC
•  NERSC hosts several PB of data in HPSS and Global FS
•  Working towards Globus Online Provider Plan
•  Data transfer working group
-  Peer with ESnet and other national labs to solve end-to-end data challenges
•  Advanced Networking Initiative with ANL/ESnet (100Gb Ethernet)
•  Data transfer nodes optimized for high performance WAN transfers
•  User-centric – User Services and data experts work with user to solve transfer
challenges
•  Support for multiple interfaces
-  Globus Online
-  GridFTP
-  BBCP
Very popular w/ users
-  Optimized SCP
•  Distribution point for scientific data – public data repositories and science
gateways
-  QCD, ESG, DeepSky, PTF, MaterialsProject, Daya Bay, C20C, SNO etc.
GO @ NERSC 2012-23 Trends
120 Sources
175 Destinations
Bytes sent to NERSC, per unique endpoint
1e+11
1e+08
Total bytes transferred
1e+02
1e+05
1e+10
1e+07
1e+04
1e+01
Total bytes transferred
1e+13
1e+14
Bytes sent from NERSC, per unique endpoint
0
50
100
150
Sources
124 Users
4.5K Transfers
46M files, 8K dirs
0.5PB, 3K deletes
0
50
100
150
200
Destinations
134 Users
3K Transfers
22M files, 7K dirs
0.4PB
We also have 100 Globus Connect users (not included above)
Optimizing Inter-Site Data Movement
•  NERSC is a net importer of data.
-  Since 2007, we receive more bytes than transfer out to other sites.
-  Leadership in inter-site data transfers (Data Transfer Working Group).
-  Have been the an archive site for several valuable scientific projects: SNO project (70TB), 20th
Century Reanalysis project (140TB)
•  NERSC needs production support for data movement. New features.
•  Partnering with ANL and ESnet to advance HPC network capabilities.
-  100Gb, Magellan ANL and NERSC cloud research infrastructure.
-  Community of GO Providers?
•  Data Transfer Working Group
-  Coordination between stake holders (ESnet, ORNL, ANL, LANL and LBNL/NERSC)
•  Data Transfer Nodes
- 
- 
- 
- 
- 
- 
- 
GlobusOnline endpoints
Available to all NERSC users
Optimized for WAN transfers and regularly tested
Close to the NERSC border
hsi for HPSS-to-HPSS transfers (OLCF and NERSC)
gridFTP for system-to-system or system-to-HPSS transfers
Bbcp too
How is this important to
Science?
Use cases in data-centric science
from projects hosted at NERSC
Science Discovery & Data Analysis
Astrophysics discover early nearby supernova
•  Palomar Transient Factory runs machine
learning algorithms on ~300GB/night delivered
by Esnet & science DMZ
•  Rare glimpse of a supernova within 11 hours of
explosion, 20M light years away
•  Telescopes world-wide redirected within 1 hour
Data systems essential to science success
•  GPFS /project file system mounted on
resources centerwide, brings broad range of
resources to the data
•  Data Transfer Nodes and Science Gateway
Nodes improve data acquisition, access and
processing capabilities
23 August
24 August
25 August
One of Science Magazine’s Top 10 Breakthroughs of 2012
Discovery of θ13 weak mixing angle
•  The last and most elusive piece of a
longstanding puzzle: How can
neutrinos appear to vanish as they
travel?
•  The answer – a new, large type of
neutrino oscillation
-  Affords new understanding of
fundamental physics
-  May help solve the riddle of
matter-antimatter asymmetry in
the universe.
Detectors count
antineutrinos near
the Daya Bay
nuclear reactor in
Japan. By
calculating how
many would be
seen if there were
no oscillation and
comparing to
measurements, a
6.0% rate deficit
provides clear
evidence of the
new
transformation.
Experiment Could Not Have Been Done Without NERSC and ESNet
• 
• 
• 
PDSF for simulation and analysis
HPSS for archiving and ingesting data
ESNet for data transfer
• 
• 
NERSC Global File System & Science
Gateways for distributing results
NERSC is the only US site where all
raw, simulated, and derived data are
analyzed and archived
PI: Kam-Biu Luk (LBNL)
BES Light Sources
•  Current and emerging needs in prompt and off-line computational analysis
of light source data show bottlenecks.
•  150TB / experiment
(Observed – Predicted) spot positions,
entire run, one tile
•  180K images
pixels
•  Bursty, Realtime needs
•  Automated hands-off
data migration
•  HTC Analysis
•  Detector roadmaps go
only up
•  Science moving toward
pixels
System Name
Frame
Rate (Hz)
Image size
compressed
(MB)
Data rate (MB/s)
Pilatus 6M
12
6
72
Pilatus 6M fast
25
6
150
Pilatus3 6M
100
6
600
Eiger 16M
180
16
2880
larger systems (surfaces,
clusters, membranes
•  Need data interfaces!
Dectris Ltd: pixel array detectors
Framework
Compare
Experiment
Remote
User
Remote
User
Prompt Analysis Pipeline
Science
Gateway
Beamline
User
HPC Compute Resources
image
reference
Experiment
metadata
Data
Transfer
Node
Data Pipeline
HPC Storage and Archive
Growing science drivers: single particle work, reconstruction,
heterogeneous samples
Giant mimivirus:
XFEL single particle
diffraction
AFM
Atomic force microscopy
EM
Cryo-electron microscope reconstruction
EM
Autocorrelation function
30000-images combined
Phased imaging
From X-ray diffraction
Seibert(2011) Nature
Rossmann(2011) Intervirology
Carboxysome
Meeting Challenges w/Globus Online
•  Well worn transfer paths usually work well but significant challenges connecting to new data
customers
-  Significant admin effort on all sides
-  Pinpointing bottlenecks can be tricky especially when you don’t have control over the network
-  Socialize other sites to use DTN and DMZ model
-  Data transfer working group has been HUGE help, Globus Support w/ users.
•  Need constant monitoring. Small changes or upgrades to network can kill performance.
•  Last mile problem – how do you deal with bottlenecks on the end user workstation?
•  Usability issues – most of the knobs on transfer interfaces seem like black magic to the end
user.
IDEAS:
•  Globus Online has helped alleviate pain on the user side. Can we have a view for network
sysadmins (detailed error reporting, diagnostics … )?
•  Central service to diagnose network performance issues?
•  Continuous centrally managed performance monitoring?
NERSC & Globus Online Futures
NERSC WAN Requirements: last 10 years
Daily WAN traffic in/out of NERSC over the last decade points to Tb
networking in 2016
NERSC Daily WAN Traffic since 2001
100000
GB/day
10000
1000
100
2000
2002
2004
2006
2008
2010
Year
2012
2014
2016
2018
Globus Online in 2016?
•  Yesterday: Explicit transfers using knowledge of tools, techniques and crafts.
-  Talk to the right person
-  Use the right system (tuned)
-  Use the right client software (GO, gridFTP, bbcp, …)
-  Have the right storage hardware
-  Manage the data (copies)
•  Today:
-  Hosted 3rd party, hands off transfers
-  Soon: HPSS integration, session limits, dashboards, EZ auth
•  Exascale: Cross-mounted storage systems at key sites and unattended data
set transfer capability.
-  Automation, everywhere. Maybe local caches for performance?
-  Latency guaranteed and reliable
-  QOS for bandwidth guarantee
-  Site specific authentication and authorization issues solved
2016 BES Science Depends on ESnet
•  BES lightsources (synchrotrons and FELs) are experiencing an
exponential increase of data.
-  2009 65 TB/yr , 2011 312 TB/yr, 2013 1.9 PB /yr
-  To the degree that beamline science is bracketed by data
analysis bottlenecks, these instruments will not deliver their
full scientific value
•  Beamline scientists and users are realizing that end-station data storage
and analysis is not scalable. BES communities are ready for science
gateway approach.
•  Computing facilities form the ideal crucible for developing and testing the
end-to-end system needed for current and next-generation photon
science.
•  LBNL LDRDs in 2013, ASCR Data Postdocs
Demonstration: Globus Online
advanced file transfer features
- Data synchronization
- Endpoint management
- Command line interface
62
Morning Session Recap
63
The Science DMZ
Configuration and Troubleshooting
Outline
•  Science DMZ – architectural components
•  Using perfSONAR
•  Problems With Firewalls
•  Data Transfer Nodes
Science DMZ Background
•  The data mobility performance requirements for data
intensive science are beyond what can typically be
achieved using traditional methods
-  Default host configurations (TCP, filesystems, NICs)
-  Converged network architectures designed for
commodity traffic
-  Conventional security tools and policies
-  Legacy data transfer tools (e.g. SCP)
-  Wait-for-trouble-ticket operational models for network
performance
Science DMZ Background (2)
•  The Science DMZ model describes a performance-based
approach
-  Dedicated infrastructure for wide-area data transfer
Well-configured data transfer hosts with modern tools
Capable network devices
High-performance data path which does not traverse commodity
LAN
-  Proactive operational models that enable performance
Well-deployed test and measurement tools (perfSONAR)
Periodic testing to locate issues instead of waiting for users to
complain
-  Security posture well-matched to high-performance
science applications
A small amount of packet loss makes a
huge difference in TCP performance
The Data Transfer Trifecta:
The “Science DMZ” Model
Dedicated
Systems for
Data Transfer
Data Transfer Node
• 
• 
• 
High performance
Configured for data
transfer
Proper tools, such as
Globus Online
Network
Architecture
Science DMZ
• 
• 
• 
• 
Dedicated location for DTN
Proper security
Easy to deploy - no need to
redesign the whole network
Additional info:
http://fasterdata.es.net/
Performance
Testing &
Measurement
perfSONAR
• 
• 
• 
Enables fault isolation
Verify correct operation
Widely deployed in
ESnet and other
networks, as well as
sites and facilities
Abstract Science DMZ
Border Router
perfSONAR
WAN
10G
Enterprise Border
Router/Firewall
10GE
10GE
Site / Campus
access to Science
DMZ resources
Clean,
High-bandwidth
WAN path
perfSONAR
10GE
Site / Campus
LAN
Science DMZ
Switch/Router
10GE
perfSONAR
Per-service
security policy
control points
High Latency WAN Path
High performance
Data Transfer Node
with high-speed storage
Low Latency LAN Path
70
Supercomputer Center
Border Router
WAN
Firewall
Routed
Offices
perfSONAR
Virtual
Circuit
perfSONAR
Core
Switch/Router
Front end
switch
Front end
switch
perfSONAR
Data Transfer
Nodes
High Latency WAN Path
Supercomputer
Low Latency LAN Path
Parallel Filesystem
High Latency VC Path
71
Data Site
Virtual
Circuit
VC
Provider Edge
Routers
WAN
Virtual
Circuit
perfSONAR
Data Transfer
Cluster
Border
Routers
HA
Firewalls
VC
perfSONAR
Site/Campus
LAN
Data Service
Switch Plane
perfSONAR
72
Common Threads
•  Two common threads exist in all these examples
•  Accommodation of TCP
-  Wide area portion of data transfers traverses purpose-built path
-  High performance devices that don’t drop packets
•  Ability to test and verify
-  When problems arise (and they always will), they can be solved if
the infrastructure is built correctly
-  Small device count makes it easier to find issues
-  Multiple test and measurement hosts provide multiple views of the
data path
perfSONAR nodes at the site and in the WAN
perfSONAR nodes at the remote site
The Data Transfer Trifecta:
The “Science DMZ” Model
Dedicated
Systems for
Data Transfer
Data Transfer Node
• 
• 
• 
High performance
Configured for data
transfer
Proper tools
Network
Architecture
Science DMZ
• 
• 
• 
• 
Dedicated location for DTN
Proper security
Easy to deploy - no need to
redesign the whole network
Additional info:
http://fasterdata.es.net/
Performance
Testing &
Measurement
perfSONAR
• 
• 
• 
Enables fault isolation
Verify correct operation
Widely deployed in
ESnet and other
networks, as well as
sites and facilities
74
perfSONAR
•  All the network diagrams have little perfSONAR boxes
everywhere
-  The reason for this is that consistent behavior requires correctness
-  Correctness requires the ability to find and fix problems
You can’t fix what you can’t find
You can’t find what you can’t see
perfSONAR lets you see
•  Especially important when deploying high performance
services
-  If there is a problem with the infrastructure, need to fix it
-  If the problem is not with your stuff, need to prove it
Many players in an end to end path
Ability to show correct behavior aids in problem localization
75
Regular perfSONAR Tests
•  We run regular tests to check for two things
-  TCP throughput – default is every 4 hours, ESnet runs every 6
-  One way delay and packet loss
•  perfSONAR has mechanisms for managing regular
testing between perfSONAR hosts
-  Statistics collection and archiving
-  Graphs
-  Dashboard display
•  This infrastructure is deployed now – perfSONAR hosts at
facilities can take advantage of it
•  At-a-glance health check for data infrastructure
76
perfSONAR Dashboard
Status at-a-glance
•  Packet loss
•  Throughput
•  Correctness
Current live instance at
http://ps-dashboard.es.net/
Drill-down capabilities
•  Test history between hosts
•  Ability to correlate with
other events
•  Very valuable for fault
localization
77
Throughput – Setting Expectations
Throughput measurements
•  Periodic verification
•  More concrete than packet loss
•  Easier to show to users
Current live instance at
http://ps-dashboard.es.net/
Setting expectations is key
•  If the tester can’t get good
performance, forget big data
•  If the tester works and the DTNs
do not, you’ve learned something
78
Throughput Detail Graph
•  Temporary drop in performance was due to re-route around a fiber cut
-  Latency increase
-  Clean otherwise (performance stayed high)
•  Other than that, it’s stable, and performs well (over 2Gbps per stream)
•  This is a powerful tool for expectation management
79
Quick perfSONAR Demo
•  Look up and test to a remote perfSONAR server
80
perfSONAR Deployment Locations
•  Critical to deploy such that you can test with useful semantics
•  perfSONAR hosts allow parts of the path to be tested
separately
-  Reduced visibility for devices between perfSONAR hosts
-  Must rely on counters or other means where perfSONAR can’t go
•  Effective test methodology derived from protocol behavior
-  TCP suffers much more from packet loss as latency increases
-  TCP is more likely to cause loss as latency increases
-  Testing should leverage this in two ways
Design tests so that they are likely to fail if there is a problem
Mimic the behavior of production traffic as much as possible
-  Note: don’t design your tests to succeed
The point is not to “be green” even if there are problems
The point is to find problems when they come up so that the problems are fixed
quickly
81
WAN Test Methodology – Problem
Isolation
•  Segment-to-segment testing is unlikely to be helpful
-  TCP dynamics will be different
-  Problem links can test clean over short distances
-  This also goes for testing from a Science DMZ to the border router or
first provider perfSONAR host
•  Run long-distance tests
-  Run the longest clean test you can, then look for the shortest dirty test
that includes the path of the clean test
-  In many cases, there is a problem between the two remote test
locations
•  In order for this to work, the testers need to be already
deployed when you start troubleshooting
-  ESnet has at least one perfSONAR host at each hub location. So
does Internet2. So do many regional networks.
-  If your provider does not have perfSONAR deployed, ask them why –
then ask when they will have it done
82
Wide Area Testing – Problem
Statement
Border perfSONAR
Science DMZ perfSONAR
Science DMZ perfSONAR
Border perfSONAR
perfSONAR
perfSONAR
perfSONAR
perfSONAR
10GE
10GE
10GE
10GE
Nx10GE
10GE
National Labortory
University Campus
Poor
Performance
WAN
83
Wide Area Testing – Full Context
Border perfSONAR
Science DMZ perfSONAR
Science DMZ perfSONAR
Border perfSONAR
perfSONAR
perfSONAR
perfSONAR
perfSONAR
10GE
10GE
10GE
10GE
Nx10GE
Lab
~1 msec
Nx10GE
perfSONAR
perfSONAR
10GE
100GE
perfSONAR
10GE
10GE
10GE
perfSONAR
perfSONAR
10GE
10GE
100GE
ESnet path
~30 msec
10GE
10GE
Poor
Performance
perfSONAR
100GE
Campus
~1 msec
10GE
perfSONAR
perfSONAR
10GE
100GE
Regional
Path
~2 msec
10GE
10GE
100GE
Internet2 path
~15 msec
84
Wide Area Testing – Long Clean
Test
Border perfSONAR
Science DMZ perfSONAR
Science DMZ perfSONAR
Border perfSONAR
perfSONAR
perfSONAR
perfSONAR
perfSONAR
10GE
10GE
10GE
10GE
Nx10GE
Lab
~1 msec
Nx10GE
perfSONAR
10GE
100GE
perfSONAR
perfSONAR
10GE
10GE
10GE
100GE
ESnet path
~30 msec
perfSONAR
perfSONAR
perfSONAR
10GE
100GE
10GE
10GE
48 msec
perfSONAR
10GE
10GE
Clean,
Fast
Clean,
Fast
perfSONAR
100GE
Campus
~1 msec
10GE
Regional
Path
~2 msec
10GE
10GE
100GE
Internet2 path
~15 msec
85
Poorly Performing Tests Illustrate
Likely Problem Areas
Dirty,
Slow
49 msec
Border perfSONAR
Science DMZ perfSONAR
Science DMZ perfSONAR
Border perfSONAR
perfSONAR
perfSONAR
perfSONAR
perfSONAR
10GE
Dirty,
Slow
10GE
Nx10GE
Lab
~1 msec
Nx10GE
perfSONAR
perfSONAR
10GE
10GE
10GE
100GE
ESnet path
~30 msec
perfSONAR
perfSONAR
perfSONAR
10GE
100GE
10GE
10GE
48 msec
perfSONAR
10GE
10GE
Clean,
Fast
Clean,
Fast
perfSONAR
100GE
Campus
~1 msec
49 msec
Clean,
Fast
perfSONAR
100GE
10GE
10GE
Clean,
Fast
10GE
10GE
Regional
Path
~2 msec
10GE
10GE
100GE
Internet2 path
~15 msec
86
Lessons From The Test Case
•  This testing can be done quickly if perfSONAR is already deployed
•  Huge productivity
-  Reasonable hypothesis developed quickly
-  Probable administrative domain identified
-  Testing time can be short – an hour or so at most
•  Without perfSONAR cases like this are very challenging
•  Time to resolution measured in months (I’ve worked those too)
•  In order to be useful for data-intensive science, the network must be
fixable quickly, because stuff breaks
•  The Science DMZ model allows high-performance use of the
network, but perfSONAR is necessary to ensure the whole kit
functions well
87
Science DMZ Security Model
•  Goal – disentangle security policy and enforcement for
science flows from security for business systems
•  Rationale
-  Science flows are relatively simple from a security
perspective
-  Narrow application set on Science DMZ
Data transfer, data streaming packages
No printers, document readers, web browsers, building control
systems, staff desktops, etc.
-  Security controls that are typically implemented to
protect business resources often cause performance
problems
88
Performance Is A Core Requirement
•  Core information security principles
-  Confidentiality, Integrity, Availability (CIA)
-  These apply to systems as well as to information, and have farreaching effects
Credentials for privileged access must typically be kept confidential
Systems that are faulty or unreliable are not useful scientific tools
Data access is sometimes restricted, e.g. embargo before publication
Some data (e.g. medical data) has stringent requirements
•  In data-intensive science, performance is an additional core mission
requirement
-  CIA principles are important, but if the performance isn’t there
the science mission fails
-  This isn’t about “how much” security you have, but how the
security is implemented
-  We need to be able to appropriately secure systems in a way that
does not compromise performance or hinder the adoption of
advanced services
89
Placement Outside the Firewall
•  The Science DMZ resources are placed outside the enterprise
firewall for performance reasons
-  The meaning of this is specific – Science DMZ traffic does not
traverse the firewall data plane
-  This has nothing to do with whether packet filtering is part of the
security enforcement toolkit
•  Lots of heartburn over this, especially from the perspective of a
conventional firewall manager
-  Lots of organizational policy directives mandating firewalls
-  Firewalls are designed to protect converged enterprise
networks
-  Why would you put critical assets outside the firewall???
•  The answer is that firewalls are typically a poor fit for highperformance science applications
90
Let’s Talk About Firewalls
•  A firewall’s job is to enhance security by blocking activity that
might compromise security
-  This means that a firewall’s job is to prevent things from happening
-  Traditional firewall policy doctrine dictates a default-deny policy
Find out what business you need to do
Block everything else
•  Firewalls are typically designed for commodity or enterprise
environments
-  This makes sense from the firewall designer’s perspective – lots of IT
spending in commodity environments
-  Firewall design choices are well-matched to commodity traffic profile
High device count, high user count, high concurrent flow count
Low per-flow bandwidth
Highly capable inspection and analysis of business applications
91
Thought Experiment
•  We’re going to do a thought experiment
•  Consider a network between three buildings – A, B, and C
•  This is supposedly a 10Gbps network end to end (look at the
links on the buildings)
•  Building A houses the border router – not much goes on there
except the external connectivity
•  Lots of work happens in building B – so much so that the
processing is done with multiple processors to spread the load
in an affordable way, and results are aggregated after
•  Building C is where we branch out to other buildings
•  Every link between buildings is 10Gbps – this is a 10Gbps
network, right???
92
Notional 10G Network Between
Buildings
Building B
WAN
10GE
10GE
1G
1G
1G
1G
1G
perfSONAR
1G
1G
1G
Building Layout
To Other Buildings
1G
1G
Building A
1G 1G 1G
1G
1G
1G
1G 1G 1G
1G
Building C
10GE
10GE
10GE
10GE
93
Clearly Not A 10Gbps Network
•  If you look at the inside of Building B, it is obvious from a
network engineering perspective that this is not a 10Gbps
network
-  Clearly the maximum per-flow data rate is 1Gbps, not 10Gbps
-  However, if you convert the buildings into network elements while
keeping their internals intact, you get routers and firewalls
-  What firewall did the organization buy? What’s inside it?
-  Those little 1G “switches” are firewall processors
•  This parallel firewall architecture has been in use for
years
-  Slower processors are cheaper
-  Typically fine for a commodity traffic load
-  Therefore, this design is cost competitive and common
94
Notional 10G Network Between
Devices
Firewall
WAN
10GE
10GE
1G
1G
1G
1G
1G
perfSONAR
1G
Device Layout
To Other Buildings
1G
1G
Border Router
1G 1G 1G
1G
1G
1G
1G
1G
1G 1G 1G
1G
Internal Router
10GE
10GE
10GE
10GE
95
Notional Network Logical Diagram
Border Router
WAN
Border Firewall
10GE
10GE
10GE
perfSONAR
10GE
10GE
10GE
Internal Router
96
Firewall Example
•  Totally protected campus, with a border firewall Performance Behind the Firewall
•  Blue = “Outbound”, e.g. campus to remote loca>on upload •  Green = “Inbound”, e.g. download from remote loca>on What’s Inside Your Firewall?
•  Vendor: “But wait – we don’t do this anymore!”
-  It is true that vendors are working toward line-rate 10G firewalls, and
some may even have them now
-  10GE has been deployed in science environments for over 10 years
-  Firewall internals have only recently started to catch up with the 10G
world
-  100GE is being deployed now, 40Gbps host interfaces are available
now
-  Firewalls are behind again
•  In general, IT shops want to get 5+ years out of a firewall purchase
-  This often means that the firewall is years behind the technology curve
-  Whatever you deploy now, that’s the hardware feature set you get
-  When a new science project tries to deploy data-intensive resources, they
get whatever feature set was purchased several years ago
99
Firewall Capabilities and Science
Traffic
•  Firewalls have a lot of sophistication in an enterprise
setting
-  Application layer protocol analysis (HTTP, POP, MSRPC, etc.)
-  Built-in VPN servers
-  User awareness
•  Data-intensive science flows don’t match this profile
-  Common case – data on filesystem A needs to be on filesystem Z
Data transfer tool verifies credentials over an encrypted channel
Then open a socket or set of sockets, and send data until done (1TB, 10TB,
100TB, …)
-  One workflow can use 10% to 50% or more of a 10G network link
•  Do we have to use a firewall?
100
Firewalls As Access Lists
•  When you ask a firewall administrator to allow data transfers
through the firewall, what do they ask for?
-  IP address of your host
-  IP address of the remote host
-  Port range
-  That looks like an ACL to me!
•  No special config for advanced protocol analysis – just
address/port
•  Router ACLs are better than firewalls at address/port filtering
-  ACL capabilities are typically built into the router
-  Router ACLs typically do not drop traffic permitted by policy
101
Security Without Firewalls
•  Data intensive science traffic interacts poorly with
firewalls
•  Does this mean we ignore security? NO!
-  We must protect our systems
-  We just need to find a way to do security that does not
prevent us from getting the science done
•  Key point – security policies and mechanisms that
protect the Science DMZ should be implemented so
that they do not compromise performance
102
If Not Firewalls, Then What?
•  Remember – the goal is to protect systems in a way that
allows the science mission to succeed
•  I like something I heard at NERSC – paraphrasing:
“Security controls should enhance the utility of science
infrastructure.”
•  There are multiple ways to solve this – some are
technical, and some are organizational/sociological
•  I’m not going to lie to you – this is harder than just putting
up a firewall and closing your eyes
103
Other Technical Capabilities
•  Intrusion Detection Systems (IDS)
-  One example is Bro – http://bro-ids.org/
-  Bro is high-performance and battle-tested
Bro protects several high-performance national assets
Bro can be scaled with clustering: http://www.bro-ids.org/documentation/cluster.html
-  Host-based IDS can run on the DTN
•  Netflow and IPFIX can provide intelligence, but not filtering
•  Openflow and SDN
-  Using Openflow to control access to a network-based service
seems pretty obvious
-  There is clearly a hole in the ecosystem with the label “Openflow
Firewall” – I really hope someone is working on this
-  Could significantly reduce the attack surface for any authenticated
network service
-  This would only work if the Openflow device had a robust data
plane
104
Other Technical Capabilities (2)
•  Aggressive access lists
-  More useful with project-specific DTNs
-  If the purpose of the DTN is to exchange data with a small set of
remote collaborators, the ACL is pretty easy to write
-  Large-scale data distribution servers are hard to handle this way
(but then, the firewall ruleset for such a service would be pretty
open too)
•  Limitation of the application set
-  One of the reasons to limit the application set in the Science DMZ
is to make it easier to protect
-  Keep desktop applications off the DTN (and watch for them
anyway using logging, netflow, etc – take violations seriously)
-  This requires collaboration between people – networking, security,
systems, and scientists
105
The Data Transfer Trifecta:
The “Science DMZ” Model
Dedicated
Systems for
Data Transfer
Data Transfer Node
• 
• 
• 
High performance
Configured for data
transfer
Proper tools
Network
Architecture
Science DMZ
• 
• 
• 
• 
Dedicated location for DTN
Proper security
Easy to deploy - no need to
redesign the whole network
Additional info:
http://fasterdata.es.net/
Performance
Testing &
Measurement
perfSONAR
• 
• 
• 
Enables fault isolation
Verify correct operation
Widely deployed in
ESnet and other
networks, as well as
sites and facilities
106
The Data Transfer Node
•  The DTN is where it all comes together
-  Data comes in from the network
-  Data received by application (e.g. Globus Online)
-  Data written to storage
•  Dedicated systems are important
-  Tuning for DTNs is specific
-  Best to avoid engineering compromises for competing applications
•  Hardware selection and configuration are key
-  Motherboard architecture
-  Tuning: BIOS, firmware, drivers, networking, filesystem, application
•  Hats off to Eric Pouyoul
107
DTN Hardware Selection
•  The typical engineering trade-offs between cost,
redundancy, performance, and so forth apply when
deciding on what hardware to use for a DTN
-  e.g. redundant power supplies, SATA or SAS disks, cost/
performance for RAID controllers, quality of NICs, etc.
•  We recommend getting a system that can be
expanded to meet future storage/bandwidth
requirements.
-  science data is growing faster than Moore Law, so might as
well get a host that can support 40GE now, even if you only
have a 10G network
•  Buying the right hardware is not enough. Extensive
tuning is needed for a well optimized DTN.
108
Low performance: SuperMicro X7DWT
109
High Performance Motherboard
110
Motherboard and Chassis selection
•  Full 40GE requires a PCI Express gen 3 (aka
PCIe gen3) motherboard
-  this means are limited to a Intel Sandy Bridge or Ivy
Bridge host.
•  Other considerations are memory speed, number
of PCI slots, extra cooling, and an adequate
power supply.
•  Intel’s Sandy/Ivy Bridge (Ivy Bridge is about 20%
faster) CPU architecture provides these features,
which benefit a DTN:
-  PCIe Gen3 support ( up to 16 GB/sec )
-  Turbo boost (up to 3.9 Ghz for the i7)
111
PCI Bus Considerations
•  Make sure the motherboard you select has the right number of slots
with the right number of lanes for you planned usage. For example:
-  10GE NICs require a 8 lane PCIe-2 slot
-  40G/QDR NICs require a 8 lane PCIe-3 slot
-  HotLava has a 6x10GE NIC that requires a 16 lane PCIe-2 slot
-  Most RAID controllers require 8 lane PCIe-2 slot
-  Very high-end RAID controllers for SSD might require a 16 lane PCIe-2 slot or
a 8 line PCIe-3 slot
-  A high-end Fusion IO ioDrive requires a 16 lane PCIe-2 slot
•  Some motherboards to look at include the following:
•  SuperMicro X9DR3-F
•  Sample Dell Server (Poweredge r320-r720)
•  Sample HP Server (ProLiant DL380p gen8 High Performance model)
112
Network Subsystem Selection
•  There is a huge performance difference between cheap and
expensive 10G NICs.
-  You should not go cheap with the NIC, as a high quality NIC is
important for an optimized DTN host.
•  NIC features to look for include:
- 
- 
- 
- 
support for interrupt coalescing
support for MSI-X
TCP Offload Engine (TOE)
support for zero-copy protocols such as RDMA (RoCE or iWARP)
•  Note that many 10G and 40G NICs come in dual ports, but that
does not mean if you use both ports at the same time you get
double the performance. Often the second port is meant to be
used as a backup port.
-  True 2x10G capable cards include the Myricom 10G-PCIE2-8C2-2S
and the Mellanox MCX312A-XCBT.
113
Tuning
•  Defaults are usually not appropriate for performance.
•  What needs to be tuned:
-  BIOS
-  Firmware
-  Device Drivers
-  Networking
-  File System
-  Application
114
Tuning Methodology
•  At any given time, one element in the system is preventing it from going faster.
•  Step 1: Unit Tuning: focusing on the DTN workflow, experiment and
adjust element by element until reaching the maximum bare metal
performance.
•  Step 2: Run application (all elements are used) and refine tuning until
reaching goal performance.
•  A more thorough treatment of tuning is at the end of this deck (this is
a 30-60 minute talk in itself!)
•  Tuning your DTN host is extremely important. We have seen overall
IO throughput of a DTN more than double with proper tuning.
•  Tuning can be as much art as a science. Due to differences in
hardware, its hard to give concrete advice.
115
More DTN Info
•  Tuning:
•  http://fasterdata.es.net/science-dmz/DTN/tuning/
•  Reference design:
•  http://fasterdata.es.net/science-dmz/DTN/
reference-implementation/
116
Globus Connect Multiuser
for providers
Deliver advanced data management
services to researchers
Provide an integrated user
experience
Reduce your support burden
117
Globus Connect Multiuser
Globus Connect Multiuser
MyProxy
Online CA
GridFTP
Server
Local Storage System
(HPC cluster, campus server, …)
Local system users
•  Create endpoint in minutes; no complex GridFTP install
•  Enable all users with local accounts to transfer files
•  Native packages: RPMs and DEBs
•  Also available as part of the Globus Toolkit
118
GCMU Demonstration
119
Hands-on Acess
The goal of this session is to show you
(hands-on) how to take a resource and
turn it into a Globus Online endpoint.
Each of you is provided with an amazon
EC2 machine for this tutorial.
Step 1 is to create a Globus Online
account
120
Log into your host
Your slip of paper has the host
information.
Log in as user “sc13”:
Ssh [email protected]
Use the password on the slip of paper.
sc13 has passwordless sudo privileges.
121
Let’s get started…
On RPM distributions:
$ yum install http://www.globus.org/ftppub/gt5/5.2/stable/
installers/repo/Globus-5.2.stableconfig.fedora-18-1.noarch.rpm
$ yum install globus-connect-multiuser
On Debian distributions:
$ curl –LOs http://www.globus.org/ftppub/gt5/5.2/stable/
installers/repo/globus-repository-5.2-stableprecise_0.0.3_all.deb
$ sudo dpkg –i globus-repository-5.2-stableprecise_0.0.3_all.deb
$ sudo aptitude update
$ sudo aptitute –y install globus-connect-multiuser
122
GCMU Basic Configuration
•  Edit the GCMU configuration file:
/etc/globus-connect-multiuser.conf
–  Set [Endpoint] Host = ”myendpointname”
•  Run: globus-connect-multiuser-setup
–  Enter your Globus Online username and password
when prompted
•  That’s it! You now have an endpoint ready for
file transfers
•  More configurations at:
support.globusonline.org/forums/22095911
123
Access endpoint on Globus Online
•  Now go to your account on
www.globusonline.org
•  Go to Start Transfer
•  Access the endpoint you just created
‘<your Globus Online
username>#sc13dtn’
124
Globus Online–GCMU interaction
Step 1
Access Endpoint
Globus Online
username
Step 2 password
(hosted service)
TLS
handshake
Step 3
Step 5
username
password
Step 6
certificate
Transfer
request
certificate
Globus Connect Multiuser
MyProxy
Online CA
Step 9
GridFTP
Server
PAM
Step 4
Step 7
certificate
Authentication
& Data Transfer
Step 8
Authorization
Campus
Cluster
username
password
Access files
certificate
Local Authentication System
(LDAP, RADIUS, Kerberos, …)
125
Local
Storage
GridFTP
Server
Remote Cluster
/ User’s PC
MyProxy OAuth server
•  Web-based endpoint activation
–  Sites run a MyProxy OAuth server
o 
MyProxy OAuth server in Globus Connect Multiuser
–  Users enter username/password only on site’s
webpage to activate an endpoint
–  Globus Online gets short-term X.509 credential via
OAuth protocol
•  MyProxy without Oauth
–  Site passwords flow through Globus Online to site
MyProxy server
–  Globus Online does not store passwords
–  Still a security concern for some sites
126
Globus Connect Multiuser
Access Endpoint
Globus Online
Step 1
Step 3
username
password
(hosted service)
Step 2
Step 7
certificate
redirect
Transfer
request
certificate
Globus Connect Multiuser
Step 4
OAuth
Server
Step 8
username
password
MyProxy
Online CA
certificate
Step 6
Step 11
GridFTP
Server
PAM
Step 5
Step 9
Step 10
Authorization
Campus
Cluster
username
password
Access files
certificate
Local Authentication System
(LDAP, RADIUS, Kerberos, …)
127
Local
Storage
certificate
Authentication
& Data Transfer
GridFTP
Server
Remote Cluster
/ User’s PC
Try doing the following
Create a file called tutorial.txt in /home/joe
Go to the GO Web UI -> Start Transfer
Select endpoint username#sc13dtn
Activate the endpoint as user “joe” (not
sc13). You should see joe's home directory.
Transfer between this endpoint and your
Globus Connect or any other endpoint you
have access to.
128
Making endpoint public
Try to access the endpoint created by the
person sitting next to you
You will get the following message:
‘Could not find endpoint with name
‘xsede13dtn’ owned by user ‘<neighbor’s
username>’
129
Making endpoint public
Go to your Globus Connect Multiuser server
machine
sudo vi /etc/globus-connect-multiuser.conf
<-- uncomment “Public = False” and replace
“False” with “True”
Now when you access your neighbor’s
endpoint, you will not get the message
anymore, rather you will be prompted for
username and password
But you can not access it since you do not
have an account on that machine
130
Enable sharing
sudo vi /etc/globus-connect-multiuser.conf
<-- uncomment Sharing = True
Go to the GO Web UI -> Start Transfer
Select endpoint username#xsede13dtn
Share your home directory on this endpoint
with neighbor
Access your neighbor’s endpoint
131
Firewall configuration
•  Allow inbound connections to port 2811
(GridFTP control channel), 7512 (MyProxy CA),
443 (OAuth)
•  Allow inbound connections to ports
50000-51000 (GridFTP data channel)
–  If transfers to/from this machine will happen only from/
to a known set of endpoints (not common), you can
restrict connections to this port range only from those
machines
•  If your firewall restricts outbound connections
–  Allow outbound connections if the source port is in the
range 50000-51000
132
Enable your resource. It’s easy.
•  Signup: globusonline.org/signup
•  Connect your system:
globusonline.org/gcmu
•  Learn:
support.globusonline.org/forums/22095911
•  Need help? support.globusonline.org
•  Follow us: @globusonline
133
End of GCMU Basic Tutorial
134
GCMU Advanced Configuration
•  Customizing filesystem access
•  Using host certificates
•  Using CILogon certificates
•  Configuring multiple GridFTP servers
•  Setting up an anonymous endpoint
135
Path Restriction
•  Default configuration:
–  All paths allowed
–  Access control handled by the OS
•  Use RestrictPaths to customize
–  Specifies a comma separated list of full paths that clients may
access
–  Each path may be prefixed by R (read) and/or W (write), or N
(none) to explicitly deny access to a path
–  '~’ for authenticated user’s home directory, and * may be used
for simple wildcard matching.
•  E.g. Full access to home directory, read access to /data:
–  RestrictPaths = RW~,R/data
•  E.g. Full access to home directory, deny hidden files:
–  RestrictPaths = RW~,N~/.*
136
Sharing Path Restriction
•  Define additional restrictions on which paths your users
are allowed to create shared endpoint
•  Use SharingRestrictPaths to customize
–  Same syntax as RestrictPaths
•  E.g. Full access to home directory, deny hidden files:
–  RestrictPaths = RW~,N~/.*
•  E.g. Full access to public folder under home directory:
–  RestrictPaths = RW~/public
•  E.g. Full access to /project, read access to /scratch:
–  RestrictPaths = RW/project,R/scratch
137
Per user sharing control
•  Default: SharingFile = False
–  Sharing is enabled for all users when Sharing = True
•  SharingFile = True
–  Sharing is enabled only for users who have the file
~/.globus_sharing
•  SharingFile can be set to an existing path in order
for sharing to be enabled
–  e.g. enable sharing for any user for which a file exists in /var/
globusonline/sharing/:
–  SharingFile = "/var/globusonline/sharing/$USER”
138
Using a host certificate for GridFTP
•  Comment
–  FetchCredentialFromRelay = True
•  Set
–  CertificateFile = <path_to_host_certificate>
–  KeyFile = <path_to_private
key_associated_with_host_certificate>
–  TrustedCertificateDirectory =
<path_to_trust_roots>
139
Use CILogon issued certificates for
authentication
•  Your organization must allow CILogon to
release ePPN attribute in the certificate
•  Set AuthorizationMethod = CILogon in the
globus connect multiuser configuration
•  Set CILogonIdentityProvider =
<your_institution_as_listed_in_CILogo
n_identity_provider_list>
•  Add CILogon CA to your trustroots (/var/lib/
globus-connect-multiuser/grid-security/
certificates/)
140
Setting up additional GridFTP servers
for your endpoint
• 
curl -LOs http://www.globus.org/ftppub/gt5/5.2/stable/
installers/repo/globus-repository-5.2-stableoneiric_0.0.3_all.deb
• 
sudo dpkg -i globus-repository-5.2-stableoneiric_0.0.3_all.deb
• 
sudo aptitude update
• 
sudo aptitude -y install globus-connect-multiuser
• 
sudo vi /etc/globus-connect-multiuser.conf
• 
<-- comment ‘Server = %(HOSTNAME)s’ in ‘MyProxy Config’
• 
Copy contents of ‘/var/lib/globus-connect-multiuser/grid-security/
certificates/’ from the first machine to same location on this machine
• 
sudo globus-connect-multiuser-setup <-- enter Globus Online
username and password
141
Setting up additional GridFTP servers
for your endpoint
$ curl –LOs http://www.globus.org/ftppub/gt5/5.2/stable/
installers/repo/globus-repository-5.2-stableprecise_0.0.3_all.deb
$ dpkg –I globus-repository-5.2-stableprecise_0.0.3_all.deb
$ sudo aptitude update
$ sudo aptitute –y install globus-connect-multiuser
$ sudo vi /etc/globus-connect-multiuser.conf
• 
Comment Server = %(HOSTNAME)s in [MyProxy] section
• 
Copy contents of ‘/var/lib/globus-connect-multiuser/grid-security/
certificates/’ from the first machine to same location on this machine
$ sudo globus-connect-multiuser-setup
• 
Enter Globus Online username and password when prompted
142
Setting up an anonymous endpoint
$ globus-gridftp-server –aa –anonymous-user <user>
•  –anonymous-user <user> needed if run as root
$ endpoint-add <name> -p ftp://<host>:<port>
$ endpoint-modify --myproxy-server=myproxy.globusonline.org
143
End of Advanced GCMU Tutorial
144
Dataset Services
Sharing Service
Transfer Service
Globus Nexus
(Identity, Group, Profile)
Globus Toolkit
145
Globus Connect
…
Globus Online APIs
Globus Platform-as-a-Service
Branded website integration
146
Integrating with Globus identity/group
management services
147
Portal/application integration
148
Our research is supported by:
U.S. DEPARTMENT OF
ENERGY
149
We are a non-profit service
provider to the non-profit
research community
150
We are a non-profit service
provider to the non-profit
research community
Our challenge:
Sustainability
151
Globus Online End User Plans
•  Basic: Free
–  File transfer and synchronization to/from servers
–  Create private and public endpoints
–  Access to shared endpoints created by others
•  Plus: $7/month (or $70/year)
–  Create and manage shared endpoints
–  Peer-to-peer transfer and sharing
152
Globus Online Provider Plans
Bundle of Plus subscriptions
+ Features for providers
http://globusonline.org/provider-plans
(Working on NET+ offering)
153
Move. Sync. Share. It’s easy.
•  Signup: globusonline.org/signup
•  Connect your system:
globusonline.org/globus_connect
•  Learn: support.globusonline.org/forums
•  Need help? support.globusonline.org
•  Follow us: @globusonline
154
NCSA Case Study @ SC2013
Jason Alt
Q: How can we solve our data management needs as we jump to the next scale of compu9ng? 156 Begin By Creating A Wish List
• 
• 
• 
• 
• 
Submit and forget requests
Automate the transfers
Retry failed transfers
Load balance across available hardware
Let the user know when it’s done
Take the burden off the average user
157 How Had We Been Handling This?
•  Handful of TF clusters with a few PB of disk and
less than 100Gb of external networking
•  5PB archive with SSH and FTP access
•  Password-less access from cluster to archive
• 
• 
• 
• 
Command line based tool (mssftp/msscmd)
Automatic retry of transfers until complete
Site-specific syntax
Single node of operation
158 Are We Satisfying Our Wish List?
• 
• 
• 
• 
• 
Submit and forget requests – NO
Automated transfers - YES
Failure retry - YES
Load balancing – NO
User notification - NO
It certainly is not cool or fun to use
•  High admin overhead and user frustration
159 Q: Will these tools work with the next scale of compu9ng? 160 Blue Waters 11Petaflop System FDR IB 136GB/s FDR IB 28 x Dell R720 IE nodes 2 x 2.1GHz w/ 8 cores 1 x 40GbE 36 x Sonexion 6000 Lustre 2.1: > 26.4PB @ > 1TB/s 161 HPSS Disk Cache 4 x DDN 12k 2.4PB @ 100GB/s 400GB/s Blue Waters
Mover nodes 50 x Dell R720 2 x 2.9GHz w/ 8 cores 2 x 40GbE (Bonded) 6 x Spectra Logic T-­‐Finity 12 robo>c arms 360PB in 95580 slots 366 TS1140 Jaguars (85GB/s) 162 Will We Satisfy Our Wish List?
• 
• 
• 
• 
• 
Submit and forget requests - NO
Automated transfers - YES
Failure retry - YES
Load balancing – NO (now very necessary!)
User notification – NO
Larger data sets mean
even more frustration
163 Q: How can we solve our data management needs as we jump to the next scale of compu9ng? 164 Leverage Existing Tools That Work
•  NCSA has a large GSI and GridFTP infrastructure
•  All of our clusters use GridFTP
•  Our legacy mass storage system uses GridFTP
•  All of our users are accustomed to using GridFTP
•  Redesign tools for very large data sets
•  Easy to use
•  Fault tolerant
165 Project Code Name Import/Export
•  We designed an automated transfer mechanism
for Blue Waters
•  Support external and internal transfers
•  Support load balancing, retry on failure
•  Support CLI and eventually web
•  Requirements done, design almost done, almost
time to code and then…
Have you heard of Globus Online?
166 167 168 169 Will We Satisfy Our Wish List?
• 
• 
• 
• 
• 
Submit and forget requests - YES
Automated transfers - YES
Failure retry - YES
Load balancing – YES
User notification – YES
No coding necessary
170 Partnering With Globus Online
•  No need to create our own solution
•  Leverage solutions to problems from other sites
•  So many new features in the last year or more:
•  Flight control, round robin, increased concurrency
•  Blue Waters GO portal with OAuth
•  And much more in the pipeline
•  Regular expressions, namespace operations,
intelligent staging from tape for archives
171 Q: How can we solve our data management needs as we jump to the next scale of compu9ng? 172 A: Partnering with Globus Online has helped us achieve many of our data manage goals. 173 NCSA Use Cases For Globus Online
•  Almost all transfers now us Globus Online
•  Blue Waters and iForge have adopted it as the
supported transfer mechanism
•  Legacy read-only archive uses it as users move
data off before it shuts down
•  LSST project has begun to use it
•  Most transfers originate from the web interface
•  These use cases are really straightforward
174 Blue Waters Lustre Backups
•  2PB of disk backed up night
•  Archive files created on Blue Waters and
transferred to HPSS with Globus Online
•  Incremental backups complete in 4-6 hours
•  Written in Python using the GO RESTful API
•  Submits on archive per GO task
•  Makes individual archive completion easy to track
•  Fits with design to parallelize backups
175 Missing Integration With Existing Workflows
•  Power users with existing workflows have been
reluctant to change
•  Want Blue Waters to behave like other clusters
•  Don’t want new steps for file transfer component
•  NCSA setup GRAM and round robin DNS to
mimic GO-like functionality (lots of work!)
•  Use case loses load balancing & fault tolerance
•  More difficult for NCSA to monitor and assist
•  Need adoption from these science teams
176 Acknowledgements
This research is part of the Blue Waters sustained petascale computing project,
which is supported by the National Science Foundation (award number OCI
07-25070) and the state of Illinois. Blue Waters is a joint effort of the
University of Illinois at Urbana-Champaign, its National Center for
Supercomputing Applications, and the Great Lakes Consortium for Petascale
Computation.
177 Recap of afternoon session and
wrap-up Q&A
178
Extra slides
179
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
Storage Architectures
There are multiple options for DTN Storage
DTN
Local storage
Raid, SSD
Storage system
Fiber
Distributed file system
Infiniband, Ethernet
180
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
Storage Subsystem Selection
•  Deciding what storage to use in your DTN is based on deciding what
you are optimizing for:
•  performance, reliability, and capacity, and cost.
•  SATA disks historically have been cheaper and higher capacity,
while SAS disks typically have been the fastest.
•  However these technologies have been converging, and with SATA
3.1 is less true.
181
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
SSD Storage: on its way to become
mainstream.
•  PCIe Gen3 removes the bottleneck preventing fast I/O
•  New product range: “pro consumer” with interesting
pricing.
•  Larger choice of vendors and packaging (card vs hdd
form factor)
•  Better operating system support (both performance
and reliability)
•  Published third party benchmarks
182
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
SSD: Current State
•  Performance and reliability varies
•  Mid-range SSD (consumer) is affordable (roughly 10 times more
expensive for 10 times the performance of HDD)
•  Two packaging: HDD form-factor or PCIe card
•  Specific tuning must be done for both reliability and performance.
183
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
RAID Controllers
•  Often optimized for a given workload, rarely for
performance.
•  RAID0 is the fastest of all RAID levels but is also the
least reliable.
•  The performance of the RAID controller is a factor of the
number of drives and its own processing engine.
184
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
RAID Controllers (2)
Be sure your RAID controller has the following:
•  1GB of on-board cache
•  PCIe Gen3 support
•  dual-core RAID-on-Chip (ROC) processor if you will have more than
8 drives
One example of a RAID card that satisfies these criteria is the Areca
ARC-1882i or LSI 9271.
When using SSD, make sure that the SSD brand and model is
supported by the controller, otherwise, some optimization will not
work.
185
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
Tuning
Defaults are usually not appropriate for performance.
What needs to be tuned:
•  BIOS
•  Firmware
•  Device Drivers
•  Networking
•  File System
•  Application
186
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
Tuning Methodology
At any given time, one element in the system is preventing
it from going faster.
Step 1: Unit Tuning: focusing on the DTN workflow, experiment and
adjust element by element until reaching the maximum bare metal
performance.
Step 2: Run application (all elements are used) and refine tuning until
reaching goal performance.
187
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
DTN Tuning
Tuning your DTN host is extremely important. We have seen overall IO
throughput of a DTN more than double with proper tuning.
Tuning can be as much art as a science. Due to differences in
hardware, its hard to give concrete advice.
Here are some tuning settings that we have found do make a
difference.
•  This tutorial assumes you are running a Redhat-based Linux
system, but other types of Unix should have similar tuning nobs.
Note that you should always use the most recent version of the OS, as
performance optimizations for new hardware are added to every
release.
188
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
BIOS Tuning
For PCI gen3-based hosts, you should
•  enable “turbo boost”,
•  disable hyperthreading and node interleaving.
More information on BIOS tuning is described in this document:
http://www.dellhpcsolutions.com/assets/pdfs/
Optimal_BIOS_HPC_Dell_12G.v1.0.pdf
189
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
RAID Controller
Different RAID controllers provide different tuning controls. Check the
documentation for your controller and use the settings recommended to optimize
for large file reading.
•  You will usually want to disable any “smart” controller built-in options, as they
are typically designed for different workflows.
Here are some settings that we found increase performance on a 3ware RAID
controller. These settings are in the BIOS, and can be entered by pressing Alt+3
when the system boots up.
• 
Write cache – Enabled
• 
Read cache – Enabled
• 
Continue on Error – Disabled
• 
Drive Queuing – Enabled
• 
StorSave Profile – Performance
• 
Auto-verify – Disabled
• 
Rapid RAID recovery – Disabled
190
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
I/O tuning: vmstat, mpstat.
From man page:
reports information about processes, memory, paging,
block IO, traps, and cpu activity.
•  Shown true I/O operation
•  Shows CPU bottlenecks
•  Shows memory usage
•  Shows locks
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----r b swpd free buff cache si so
bi
bo in cs us sy id wa st
0 0
0
0
0 22751248 192800 1017000
0
0
4
7 0 0 100 0 0
191
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
Simulation & Benchmarking I/O: “dd test”
# dd of=/dev/null if=/storage/data1/test-file1 bs=4k &
# dd of=/dev/null if=/storage/data1/test-file2 bs=4k &
# dd of=/dev/null if=/storage/data2/test-file1 bs=4k &
# dd of=/dev/null if=/storage/data2/test-file2 bs=4k &
# dd of=/dev/null if=/storage/data3/test-file1 bs=4k &
# dd of=/dev/null if=/storage/data3/test-file2 bs=4k &
192
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
Example vmstat / dd
# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----r b swpd free buff cache si so
bi
0
bo in cs us sy id wa st
6 0
0 150132 215204 23428260
0
2 3
0 1692948 218924 21920000
0
0 4428 499712 24599 5341 1 29 65 6 0
2 5
0 1610216 222512 22001264
0
0 3532 725012 25230 5363 0 15 75 10 0
4 5
0 720020 224532 22865412
3 7
0 1917556 225440 21686980
0
0 1672 1099036 27333 4314 0 17 60 23 0
6 7
0 1419324 225496 22180252
0
0
3 6
0 391860 225560 23182336
0
0
4 1261536 25797 27532 0 20 48 32 0
8 4
0 80624 224672 23486864
0
0
0 1296932 26799 3373 0 22 52 26 0
3 6
0 88860 224184 23475516
0
0
0 1322248 28338 3529 0 22 51 27 0
0
0
0 16431 2245 0 13 86 0 0
0 2048 847296 24566 4277 0 13 65 22 0
0 1312704 29410 25386 0 24 45 31 0
193
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
I/O Scheduler
The default Linux scheduler is the "fair" scheduler. For a DTN node, we
recommend using the "deadline" scheduler instead.
To enable deadline scheduling, add "elevator=deadline" to the end of
the "kernel' line in your /boot/grub/grub.conf file, similar to this:
!kernel /vmlinuz-2.6.35.7 ro root=/dev/VolGroup00/
LogVol00 rhgb quiet elevator=deadline!
194
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
Block Device Tuning: readahead
Increasing the amount of "readahead" usually helps on DTN nodes
where the workflow is mostly sequential reads.
•  However you should test this, as some RAID controllers do this
already, and changing this may have adverse affects.
Setting readahead should be done at system boot time. For example,
add something like this to /etc/rc.d/rc.local:
! /sbin/blockdev --setra 262144 /dev/sdb!
More information on readahead:
http://www.kernel.org/doc/ols/2004/ols2004v2-pages-105-116.pdf
195
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
EXT4 Tuning
The file system should be tuned to the physical layout of the drives.
•  Stride and stripe-width are used to align the volume according to
the stripe-size of the RAID.
−  stride is calculated as Stripe Size / Block Size.
−  stripe-width is calculated as Stride * Number of Disks Providing
Capacity.
Disabling journaling will also improve performance, but reduces
reliability.
Sample mkfs command:
!/sbin/mkfs.ext4 /dev/sdb1 –b 4096 –E stride=64 \
stripe-width=768 –O ^has_journal!
196
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
File System Tuning (cont.)
There are also tuning settings that are done at mount time. Here are the ones that we
have found improve DTN performance:
•  data=writeback
−  this option forces ext4 to use journaling only for metadata. This gives a huge
improvement in write performance
•  inode_readahead_blks=64
−  this specifies the number of inode blocks to be read ahead by ext4’s readahead
algorithm. Default is 32.
•  Commit=300
−  this parameter tells ext4 to sync its metadata and data every 300s. This reduces
the reliability of data writes, but increases performance.
•  noatime,nodiratime
−  these parameters tells ext4 not to write the file and directory access timestamps.
Sample fstab entry:
!/dev/sdb1 /storage/data1 ext4
inode_readahead_blks=64,data=writeback,barrier=0,commit=300,noa
time,nodiratime!
For more info see: http://kernel.org/doc/Documentation/filesystems/ext4.txt
197
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
SSD Issues
Tuning your SSD is more about reliability and longevity than performance, as each
flash memory cell has a finite lifespan that is determined by the number of
"program and erase (P/E)" cycles. Without proper tuning, SSD can die within
months.
• 
never do "write" benchmarks on SSD: this will damage your SSD quickly.
Modern SSD drives and modern OSes should all includes TRIM support. Only the
newest RAID controllers include TRIM support (late 2012).
Memory Swap
To prolong SSD lifespan, do not swap on an SSD. In Linux you can control this
using the sysctl variable vm.swappiness. E.g.: add this to /etc/sysctl.conf:
−  vm.swappiness=1
This tells the kernel to avoid unmapping mapped pages whenever possible.
Avoid frequent re-writing of files (for example during compiling code from source),
use a ramdisk file system (tmpfs) for /tmp /usr/tmp, etc.
198
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
EXT4 file system tuning for SSD
These mount flags should be used for SSD partitions.
•  noatime: Reading accesses to the file system will no longer result in
an update to the atime information associated with the file. This
eliminates the need for the system to make writes to the file system
for files which are simply being read.
•  discard: This enables the benefits of the TRIM command as long for
kernel version >=2.6.33.
Sample /etc/fstab entry:
/dev/sda1 /home/data ext4 defaults,noatime,discard 0 1!
For more information see:
http://wiki.archlinux.org/index.php/Solid_State_Drives
199
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
Network Tuning
# add to /etc/sysctl.conf
net.core.rmem_max = 33554432
# make sure cubic and htcp are loaded
net.core.wmem_max = 33554432
/sbin/modprobe tcp_htcp
net.ipv4.tcp_rmem = 4096 87380 33554432
/sbin/modprobe tcp_cubic
net.ipv4.tcp_wmem = 4096 65536 33554432
# set default to CC alg to htcp
net.core.netdev_max_backlog = 250000
net.ipv4.tcp_congestion_control=htcp
Add to /etc/rc.local
# with the Myricom 10G NIC increasing
interrupt coalencing helps a lot:
# increase txqueuelen
/sbin/ifconfig eth2 txqueuelen 10000
/usr/sbin/ethtool -C ethN rx-usecs 75
/sbin/ifconfig eth3 txqueuelen 10000
And use Jumbo Frames!
200
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
NIC Tuning
As the bandwidth of a NIC goes up (10G, 40G), it becomes critical
to fine tune the NIC.
•  Lot of interrupts per second (IRQ)
•  Network protocols requires to process data (CRC) and copy data
from/to application’s space.
•  NIC may interfere with other components such as the RAID
controller.
Do not forget to tune TCP and other network parameters as
described previously.
201
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
Interrupt Affinity
•  Interrupt handlers are executed on a core
•  Depending on the scheduler, core 0 gets all the interrupts, or
interrupts are dispatched in a round-robin fashion among the cores:
both are bad for performance:
•  Core 0 get all interrupts: with very fast I/O, the core is
overwhelmed and becomes a bottleneck
•  Round-robin dispatch: very likely the core that executes the
interrupt handler will not have the code in its L1 cache.
•  Two different I/O channels may end up on the same core.
202
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
A simple solution: interrupt binding
-  Each interrupt is statically bound to a given core (network -> core 1,
disk -> core 2)
-  Works well, but can become an headache and does not fully solve
the problem: one very fast card can still overwhelm the core.
-  Needs to bind application to the same cores for best optimization:
what about multi-threaded applications, for which we want one
thread = one core ?
Linux by default runs “irqbalanced”, a daemon that automatically
balances interrupts across the cores. It needs to be disabled if you
bind interrupts manually.
203
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science
Study case: Areca 1882 versus LSI 9271
•  Both are PCIe 3.0
•  Both uses the same processor (dual core ROC)
•  Similar hardware design
However, both cards performs very differently due to their firmware and
drivers: while the Areca card seems to have better RAID tuning
knobs, it does not allow more than one interrupt to be used for I/O.
The LSI can balance its I/O onto several interrupts, effectively
dispatching the load onto several cores. Because of this problem
and when using SSD, the LSI card delivers about three times more
I/O performance.
204
Lawrence Berkeley National Laboratory
U.S. Department of Energy | Office of Science