Lustre Scalability Eric Barton Lead Engineer, Lustre Group Sun Microsystems Lustre User Group 2009 1 Topics HPC Trends Architectural Improvements General Performance Enhancements Lustre User Group 2009 2 HPC Trends Processor performance / RAM growing faster than I/O • Relative number of I/O devices must grow to compensate • Storage component reliability not increasing with capacity > Failure is not an option – it’s guaranteed Trend to shared file systems • Multiple compute clusters • Direct access from specialized systems Storage scalability critical Lustre User Group 2009 3 HPC Center of the Future Capability 500,000 Nodes Capacity 1 250,000 Nodes Capacity 2 150,000 Nodes Capacity 3 50,000 Nodes Test 25,000 Nodes Viz 1 Viz 2 WAN Access Shared Storage Network 10 TB/sec User Data 1000 MDTs Metadata 25 MDS’s HPSS Archive Lustre Storage Cluster Lustre User Group 2009 4 Lustre Scalability Definition • Performance / capacity grows nearly linearly with hardware • Component failure does not have a disproportionate impact on availability Requirements • • • • • Scalable I/O & MD performance Expanded component size/count limits Increased robustness to component failure Overhead grows sub-linearly with system size Timely failure detection & recovery Lustre User Group 2009 5 100000 10000 Lustre User Group 2009 10 1 Lustre 2.0 Lustre 1.8 Future 100 Future 1000 Lustre 3.0 MDS Lustre 3.0 Lustre 2.0 1000000 Lustre 1.6 Clients Lustre 1.8 Lustre 1.6 Future Lustre 3.0 Lustre 2.0 Lustre 1.8 Lustre 1.6 Lustre Scaling OSS 3000 2500 2000 1500 1000 500 0 6 Architectural Improvements Clustered Metadata (CMD) • 10s – 100s of metadata servers • Distributed inodes > Files local to parent directory entry / subdirs may be non-local • Distributed directories > Hashing Striping • Distributed Operation Resilience/Recovery > Uncommon HPC workload - Cross-directory rename > Short term - Sequenced cross-MDS ops > Longer term - Transactional - ACID - Non-blocking - deeper pipelines - Hard - cascading aborts, synch ops Lustre User Group 2009 7 Epochs Oldest Epoch Global Oldest Volatile Epoch Reduction Network Current Globally Known Newest Oldest Volatile Epoch Epoch Stable Unstable Committed Uncommitted Updates Server 1 Server 2 Operations Server 3 Client 1 Local Oldest Volatile Epochs Client 2 Redo Lustre User Group 2009 8 Architectural Improvements Fault Detection Today • RPC timeout > Timeouts must scale O(n) to distinguish death / congestion • Pinger > No aggregation across clients or servers > O(n) ping overhead • Routed Networks > Router failure can be confused with end-to-end peer failure • Fully automatic failover scales with slowest time constant > Many 10s of minutes on large clusters > Failover could be much faster if “useless” waiting eliminated Lustre User Group 2009 9 Architectural Improvements Scalable Health Network • Burden of monitoring clients distributed – not replicated > ORNL – 35,000 clients, 192 OSSs, 7 OSTs/OSS • Fault-tolerant status reduction/broadcast network > Servers and LNET routers • LNET high-priority small message support > Health network stays responsive • Prompt, reliable detection > Time constants in seconds > Failed servers, clients and routers > Recovering servers and routers Interface with existing RAS infrastructure • Receive and deliver status notification Lustre User Group 2009 10 Health Monitoring Network Primary Health Monitor Failover Health Monitor Client Lustre User Group 2009 11 Architectural Improvements Metadata Writeback Cache • Avoids unnecessary server communications > Operations logged/cached locally > Performance of a local file system when uncontended • Aggregated distributed operations > Server updates batched and tranferred using bulk protocols (RDMA) > Reduced network and service overhead Sub-Tree Locking > Lock aggregation – a single lock protects a whole subtree > Reduce lock traffic and server load Lustre User Group 2009 12 Architectural Improvements Current - Flat Communications model • Stateful client/server connection required for coherence and performance • Every client connects to every server • O(n) lock conflict resolution Future - Hierarchical Communications Model • Aggregate connections, locking, I/O, metadata ops • Caching clients > Aggregate local processes (cores) > I/O Forwarders scale another 32x or more • Caching Proxies > Aggregate whole clusters > Implicit Broadcast - scalable conflict resolution Lustre User Group 2009 13 Hierarchical Communications Lustre Storage Cluster Proxy Cluster Cluster Proxy Proxy Server WBC Client MDS MDS MDS MDS MDS MDS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS Lustre Client IO Forwarding Server IO Forwarding Client I/O Forwarder User Proc. Proxy Server WBC Client OSS Proxy Server User Proc. Proxy Server User Proc. IO Forwarding Client User Proc. Proxy Server WBC Client Proxy Server WBC Client OSS OSS OSS OSS OSS OSS OSS OSS User Proc. OSS OSS OSS Proxy Server OSS OSS OSS OSS OSS OSS OSS OSS OSS User Proc. I/O Forwarder IO Forwarding Client User Proc. User Proc. User Proc. OSS IO Forwarding Server Lustre Client Lustre Client WBC Client Proxy Server User Proc. WAN / Security Domain Luster Comm WBC Client User Proc. System Calls Performance Improvements Network Request Scheduler (NRS) • Much larger working set than disk elevator • Higher level information - client, object, offset, job/rank Prototype • Initial development on simulator • Scheduling strategies - quanta, offset, fairness etc. • Testing at ORNL pending Future • Exchange global information - gang scheduling • QoS - Real time / Bandwidth reservation (min/max) Lustre User Group 2009 15 Performance Improvements SMP Scaling • Improve MDS performance / small message handling • CPU affinity # Client • Finer granularity locking RPC Througput RPC Trhoughput Nodes Total client processes Lustre User Group 2009 Total client processes 16 Performance Improvements Metadata Protocol • Size on MDT (SOM) > Avoid multiple RPCs for attributes derived from OSTs > OSTs remain definitive while file open > Compute on close and cache on MDT • Readdir+ > Aggregation - Directory I/O - Getattrs - Locking Lustre User Group 2009 17 Performance Improvements ZFS • Remove ldiskfs size limits • End-to-end data integrity • Hybrid storage Channel bonding • Combine multiple Network Interfaces • Failover • Capacity Lustre User Group 2009 18 Performance Improvements Rebuild performance • Frequent disk failures > Rebuild quickly to prevent data loss on next failure • Disk group remains in operation during rebuild > Avoid using OST during rebuild - Speed rebuild - Amdahl’s law • ZFS rebuild improvements Lustre User Group 2009 19 Lustre Scalability Attribute Today Future Number of Clients 10,000s Flat comms model 1,000,000s Hierarchical comms model Server Capacity Ext3 – 8TB ZFS - Petabytes Metadata Performance Single MDS Single MDS improvements CMD Recovery Time RPC timeout - O(n) Health Network - O(log n) Lustre User Group 2009 20 THANK YOU Eric Barton [email protected] [email protected] Lustre User Group 2009 21
© Copyright 2026 Paperzz