Advanced Lustre® Infrastructure Monitoring (Resolving the Storage I/O Bottleneck and managing the beast) Torben Kling Petersen, PhD Principal Architect High Performance Computing The Challenge 2 The REAL challenge • File system • Software – Up/down – Slow – Fragmented – Capacity planning – HA (Fail-overs etc) – Upgrades / patches ?? – Bugs – Clients – Quotas – Workload optimization • Hardware • Other – Nodes crashing – Components breaking – FRUs – Disk rebuilds – Cables ?? ©Xyratex – Documentation – Scalability – Power consumption – Maintenance windows – Back-ups 2013 3 The Answer ?? • Tightly integrated solutions –Hardware –Software –Support • Extensive testing • Clear roadmaps • In-depth training • Even more extensive testing ….. ©Xyratex 2013 4 ClusterStor Software Stack Overview ClusterStor 6000 Embedded Application Server Intel Sandy Bridge CPU, up to 4 DIMM slots FDR & 40GbE F/E, SAS-2 (6G) B/E SBB v2 Form Factor, PCIe Gen-3 Embedded RAID & Lustre support ClusterStor Manager Lustre File System (2.x) Data Protection Layer (RAID 6 / PD-RAID) Linux OS Unified System Management (GEM-USM) Embedded server modules CS 6000 SSU ©Xyratex 2013 5 ClusterStor dashboard Problems found 6 Hardware inventory …. 7 Hardware inventory …. 8 Finding problems ??? 9 But things brake …. Especially disk drives … What then ??? 10 Let’s do some math …. • Large systems use many HDDs to deliver both performance and capacity – NCSA BW uses 17,000+ HDDs for the main scratch FS – At 3% AFR this means 531 HDDs fail annually – That’s ~1.5 drives per day !!!! – RAID 6 rebuild time under use is 24 – 36 hours • Bottom line, the scratch system would NEVER be fully operational and there would constantly be a risk of loosing additional drives leading to data loss !! ©Xyratex 2013 11 Drive Technology/Reliability ● ● ● ● ● ● Xyratex pre-tests all drives used in ClusterStor™ solutions Each drive is subjected to 24-28 hours of intense I/O Reads and writes are performed to all sectors Ambient temperature cycles between 40 °C and 5°C Any drive surviving, goes on to additional testing As a result Xyratex disk drives deliver proven reliability with less that 0.3% annual failure rate ● Real Life Impact ○ ○ On a large system such as NCSA BlueWaters with 17,000+ disk drives, this means a predicted failure of 50 drives per year *“Other vendors” publically state a failure rate of 3%* which (given equivalent number of disk drives) means 500+ drive failures per year ■ With fairly even distribution, the file system will ALWAYS be in a state of rebuild ■ In addition as a file system with wide stripes will perform according to the slowest OST, the entire system will always run in degraded mode ….. *DDN, Keith Miller, LUG 2012 ©Xyratex 2013 12 Annual Failure Rate of Xyratex Disks ● Actual AFR Data (2012/13) Experienced by Xyratex Sourced SAS Drives ● Xyratex drive failure rate is less than half of industry standard ! ● At 0.3%, the annual failure would be 53 HDDs ©Xyratex 2013 13 Evolution of HDD technology: Impacts System Rebuild Time ● As growth in areal density growth slows (<25% per generation), disk drive manufacturers are having to increase the number of heads/platters per drive to continue to increase max capacity per drive y/y ● 2TB drives today typically includes just 5 heads and 3 platters ● 6TB drives in 2014 will include a minimum of 12 heads and 6 platters ● More components will inevitably result in an increase in disk drive failures in the field ● Therefore systems using 6TB must be able to handle the increase in the number of array rebuild events 14 Why Does HDD Reliability Matter? ● The three key factors you must consider are drive reliability, drive size and the rebuild rate of your system ○ The scary fact is: new generation HDD, bigger drives will fail more often ○ Such drive failures are even more impactful on the file system performance and the risk of data loss when using bigger drives such as 6TB or larger !! ○ The rebuild window is bigger and risk of data loss is greater ● Traditional RAID technology will take up to days to rebuild a single failed 6TB drive ● Therefore Parity De-clustered RAID Rebuild technology is essential for any HPC system 15 16 Parity Declustered RAID - Geometry ● ● ● ● ● PD RAID geometry for an array is defined as: P drive (N+K+A) example: 41 (8+2+2) P is the total number of disks in the array N is the number of data blocks per stripe K is the number of Parity blocks per stripe A is the number of distributed spare disk drives 17 Grid RAID advantage • Rebuild speed increased by more than 3.5 x • No SSDs, no NV-RAM, no accelerators ….. • PD-RAID as it was meant to be … 18 Thank you …. [email protected]
© Copyright 2025 Paperzz