Slide

Geoff Quigley, Stephen Childs
and Brian Coghlan
Trinity College Dublin
 e-INIS
Regional Datastore @TCD
 Recent storage procurement
 Physical infrastructure
 10Gb networking
• Simple lessons learned
 STEP09
Experiences
 Monitoring
• Network (STEP09)
• Storage
 The
Irish National e-Infrastructure
 Funds Grid-Ireland Operations Centre
 Creating a National Datastore
• Multiple Regional Datastores
• Ops Centre runs TCD regional datastore
 For
all disciplines
• Not just science & technology
 Projects
with (inter)national dimension
 Central allocation process
 Grid and non-grid use
 Grid-Ireland @ TCD already had
• Dell Poweredge 2950 (2xQuad Xeon)
• Dell MD1000 (SAS - JBOD)
 After procurement data store has
• 8x Dell PE2950 (6x1TB disks, 10GbE)
• 30x MD1000, each with 15x 1TB disks
some
total
 ~11.6 TiB each after RAID6 and XFS format (~350TiB total)
• 2x Dell Blade Chassis with 8x M600 blades each
• Dell tape library (24x Ultrium 4 tapes)
• HP ExDS9100 with 4 capacity blocks of 82x 1TB disks
each and 4 blades
 ~ 233 TiB total available for NFS/http export

DPM installed on Dell hardware
•
•
•

~100TB for Ops Centre to allocate
Rest for Irish users via allocation process
May also try to combine with iRODS
HP-ExDS high availability store
•
•
•
•
iRODS primarily
vNFS exports
Not for conventional grid use
Bridge services on blades for community
specific access patterns
 Room
needed upgrade
• Another cooler
• UPS maxed out
 New high-current AC circuits added
 2x
3kVA UPS per rack acquired for Dell
equipment
 ExDS has 4x 16A 3Ø - 2 on room UPS, 2
raw
 10 GbE to move data!
 Benchmarked with netperf
• http://www.netperf.org
 Initially 1-2Gb/s… not good
 Had machines that produced figures 4Gb/s +
• What’s the difference?
 Looked at a couple of documents on this:
• http://www.redhat.com/promo/summit/2008/download
s/pdf/Thursday/Mark_Wagner.pdf
• http://docs.sun.com/source/819-0938-13/D_linux.html
 Tested various of these optimisations
• Initially little improvement (~100Mb/s)
• Then identified the most important changes
 Cards fitted to wrong PCI-E port
• Were x4 instead of x8
 New kernel version
• New kernel supports MSI-X (multiqueue)
• Was saturating one core, now distributes
 Increase MTU (from 1500 to 9216)
• Large difference to netperf
• Smaller difference to real loads
 Then compared two switches with direct
connection
netperf 60s transfer test - showing repeat results for Arista switch
9000
8000
7000
Mbits/sec
6000
A-B Solo
C-D Solo
A-B Sim
C-D Sim
5000
4000
3000
2000
1000
0
Direct
Force 10
Arista
Switch
Arista rerun
netperf 60s TCP Request/Response
9000
8000
7000
Requests/sec
6000
A-B Solo
C-D Solo
A-B Sim
C-D Sim
5000
4000
3000
2000
1000
0
Direct
Force 10
Switch
Arista



Storage was mostly in place
10GbE was there but being tested
• Brought into production early in STEP09
Useful exercise for us
• See bulk data transfer in conjunction with user access to stored
data
• The first large 'real' load on the new equipment

Grid-Ireland OpsCentre at TCD involved as Tier-2 site
• Associated with NL Tier-1
Peak traffic observed
during STEP ‘09
 Data
transfers into TCD from NL
• Peaked at 440 Mbit/s (capped at 500)
• Recently upgraded FW box coped well
HEAnet view of GEANT link
TCD view of Grid-Ireland link
 Lots of analysis jobs
• Running on cluster nodes
• Accessing large datasets
directly from storage
• Caused heavy load on
network and disk servers
• Caused problems for other
jobs accessing storage
• Now known that access
patterns were pathological
 Also
production jobs
ATLAS
production
ATLAS
analysis
LHCb
production
Almost all data stored
on this server
3x1Gbit bonded links set up


Fix to distinguish FS with
identical names on different
servers
Fixed display of long labels
Display space token stats in TB
New code for pool stats
 Pool
stats first to use DPM C API
• Previously everything was done via MySQL
 Was
able to merge some of these fixes
• Time-consuming to contribute patches
• Single “maintainer” with no dedicated effort …
 MonAMI
useful but future uncertain
• Should UKI contribute effort to plugin development?
• Or should similar functionality be created for “native”
Ganglia?

Recent procurement gave us a huge increase in
capacity

STEP09 great test of data paths into and within our new
infrastructure

Identified bottlenecks and tuned configuration
•
•
•
•

Back-ported SL5 kernel to support 10GbE on SL4
Spread data across disk servers for load-balancing
Increased capacity of cluster-storage link
Have since upgraded switches
Monitoring crucial to understanding what’s going on
• Weathermap for quick visual check
• Cacti for detailed information on network traffic
• LEMON and Ganglia for host load, cluster usage, etc.
Thanks for your attention!


Ganglia monitoring system
• http://ganglia.info/
Cacti
•
Network weathermap

•

http://www.cacti.net/
http://www.network-weathermap.com/
MonAMI
• http://monami.sourceforge.net/
 Quotas
are close to becoming essential
for us
 10GbE
problems have highlighted that
releases on new platforms are needed far
more quickly
 Firewall
1Gb outbound 10Gb internally
 M8024 switch in ‘bridge’ blade chassis
• 24 port (16 to blades) layer 3 switch
 Force10
switch main ‘backbone’
• 10GbE cards in DPM servers
• 10GbE uplink from ‘National Servers’ 6224 switch
 10GbE
Copper (CX4) ExDS to M6220 in 2nd
blade chassis
• Link between 2 blade chassis M6220 - M8024
 4-way
LAG Force10 - M8024
 24
port 10Gb switch
 XFP modules
• Dell supplied our XFPs so cost per port reduced
 10Gb/s
only
 Layer 2 switch
 Same Fulcrum ASIC as Arista switch
tested
• Uses a standard reference implementation
 Arista
networks 7124S 24 port switch
 SFP+ modules
• Low cost per port (switches relatively cheap too)
 ‘Open’ software
- Linux
• Even has bash available
• Potential for customisation (e.g. iptables being ported)
 Can
run 1Gb/s and 10Gb/s simultaneously
• Just plug in the different SFPs
 Layer
2/3
• Some docs refer to layer 3 as a software upgrade
 Our
10GbE cards are Intel PCI-E
10GBASE-SR
 Dell had plugged most into the 4xPCI-E
slot
 An error was coming up in dmesg
Trivial solution:
 I moved the cards to 8x slots
 Now
can get >5Gb/s on some machines
 Maximum
Transmission Unit
• Ethernet spec says 1500
• Most hardware/software can support jumbo frames
 Ixgbe
driver allowed MTU=9216
• Must be set through whole path
• Different switches have different max value
 Makes
a big difference to netperf
 Example of SL5 machines, 30s tests:
• MTU=1500, TCP stream at 5399 Mb/s
• MTU=9216, TCP stream at 8009 Mb/s
 Machines
on SL4 kernels had very poor
receive performance (50Mb/s)
 One core was 0% idle
• Use mpstat -P ALL
• Sys/soft used up the whole core
 /proc/interrupts
showed PCI-MSI used
 All RX interrupts to one core
 New kernel had MSI-X and multiqueue
• Interrupts distributed, full RX performance











-bash-3.1$ grep eth2 /proc/interrupts
114:
247 694613 5597495 1264609
1103
15322 426508 2089709
PCI-MSI-X
eth2:v0-Rx
122:
657 2401390 462620 499858 644629
234 1660625 1098900
PCI-MSI-X
eth2:v1-Rx
130:
220 600108 453070 560354 1937777 128178 468223 3059723
PCI-MSI-X
eth2:v2-Rx
138:
27 764411 1621884 1226975 839601
473 497416 2110542
PCI-MSI-X
eth2:v3-Rx
146:
37 171163 418685 349575 1809175
17262 574859 2744006
PCI-MSI-X
eth2:v4-Rx
154:
27 251647 210168
1889 795228 137892 2018363 2834302
PCI-MSI-X
eth2:v5-Rx
162:
27
85615 2221420 286245 779341
363 415259 1628786
PCI-MSI-X
eth2:v6-Rx
170:
27 1119768 1060578 892101 1312734
813 495187 2266459
PCI-MSI-X
eth2:v7-Rx
178: 1834310 371384 149915 104323
27463 16021786
461 2405659
PCI-MSI-X
eth2:v8-Tx
186:
45
0
158
0
0
1
23
0
PCI-MSI-X eth2:lsc