TK. Petersen - Advanced Lustre Infrastructure Monitoring

Advanced Lustre® Infrastructure Monitoring
(Resolving the Storage I/O Bottleneck
and managing the beast)
Torben Kling Petersen, PhD
Principal Architect
High Performance Computing
The Challenge
2
The REAL challenge
• File system
• Software
– Up/down
– Slow
– Fragmented
– Capacity planning
– HA (Fail-overs etc)
– Upgrades / patches ??
– Bugs
– Clients
– Quotas
– Workload optimization
• Hardware
• Other
– Nodes crashing
– Components breaking
– FRUs
– Disk rebuilds
– Cables ??
©Xyratex
– Documentation
– Scalability
– Power consumption
– Maintenance windows
– Back-ups
2013
3
The Answer ??
• Tightly integrated solutions
–Hardware
–Software
–Support
• Extensive testing
• Clear roadmaps
• In-depth training
• Even more extensive testing …..
©Xyratex
2013
4
ClusterStor Software Stack Overview
ClusterStor 6000 Embedded Application Server
 Intel Sandy Bridge CPU, up to 4 DIMM slots
 FDR & 40GbE F/E, SAS-2 (6G) B/E
 SBB v2 Form Factor, PCIe Gen-3
 Embedded RAID & Lustre support
ClusterStor Manager
Lustre File System (2.x)
Data Protection Layer
(RAID 6 / PD-RAID)
Linux OS
Unified System Management
(GEM-USM)
Embedded
server modules
CS 6000 SSU
©Xyratex
2013
5
ClusterStor dashboard
Problems found
6
Hardware inventory ….
7
Hardware inventory ….
8
Finding problems ???
9
But things brake ….
Especially disk drives …
What then ???
10
Let’s do some math ….
• Large systems use many HDDs to deliver both
performance and capacity
– NCSA BW uses 17,000+ HDDs for the main scratch FS
– At 3% AFR this means 531 HDDs fail annually
– That’s ~1.5 drives per day !!!!
– RAID 6 rebuild time under use is 24 – 36 hours
• Bottom line, the scratch system would NEVER be fully
operational and there would constantly be a risk of
loosing additional drives leading to data loss !!
©Xyratex
2013
11
Drive Technology/Reliability
●
●
●
●
●
●
Xyratex pre-tests all drives used in ClusterStor™ solutions
Each drive is subjected to 24-28 hours of intense I/O
Reads and writes are performed to all sectors
Ambient temperature cycles between 40 °C and 5°C
Any drive surviving, goes on to additional testing
As a result Xyratex disk drives deliver proven reliability
with less that 0.3% annual failure rate
●
Real Life Impact
○
○
On a large system such as NCSA BlueWaters with 17,000+ disk drives, this means a
predicted failure of 50 drives per year
*“Other vendors” publically state a failure rate of 3%* which (given equivalent number
of disk drives) means 500+ drive failures per year
■ With fairly even distribution, the file system will ALWAYS be in a state of rebuild
■ In addition as a file system with wide stripes will perform according to the
slowest OST, the entire system will always run in degraded mode …..
*DDN, Keith Miller, LUG 2012
©Xyratex
2013
12
Annual Failure Rate of Xyratex Disks
● Actual AFR Data (2012/13) Experienced by Xyratex Sourced SAS Drives
● Xyratex drive failure rate is less than half of industry standard !
● At 0.3%, the annual failure would be 53 HDDs
©Xyratex
2013
13
Evolution of HDD technology:
Impacts System Rebuild Time
●
As growth in areal density growth slows (<25% per
generation), disk drive manufacturers are having to
increase the number of heads/platters per drive to
continue to increase max capacity per drive y/y
●
2TB drives today typically includes just
5 heads and 3 platters
●
6TB drives in 2014 will include a minimum of
12 heads and 6 platters
●
More components will inevitably result in an
increase in disk drive failures in the field
●
Therefore systems using 6TB must be able to handle
the increase in the number of array rebuild events
14
Why Does HDD Reliability Matter?
● The three key factors you must consider are drive reliability, drive size
and the rebuild rate of your system
○ The scary fact is: new generation HDD, bigger drives will fail more often
○ Such drive failures are even more impactful on the file system performance
and the risk of data loss when using bigger drives such as 6TB or larger !!
○ The rebuild window is bigger and risk of data loss is greater
● Traditional RAID technology will take up to days to rebuild a single
failed 6TB drive
● Therefore Parity De-clustered RAID Rebuild technology is
essential for any HPC system
15
16
Parity Declustered RAID - Geometry
●
●
●
●
●
PD RAID geometry for an array is defined as:
P drive (N+K+A)
example: 41 (8+2+2)
P is the total number of disks in the array
N is the number of data blocks per stripe
K is the number of Parity blocks per stripe
A is the number of distributed spare disk drives
17
Grid RAID advantage
• Rebuild speed increased by more than 3.5 x
• No SSDs, no NV-RAM, no accelerators …..
• PD-RAID as it was meant to be …
18
Thank you ….
[email protected]