Lustre Scalability

Lustre Scalability
Eric Barton
Lead Engineer, Lustre Group
Sun Microsystems
Lustre User Group 2009
1
Topics
HPC Trends
Architectural Improvements
General Performance Enhancements
Lustre User Group 2009
2
HPC Trends
Processor performance / RAM growing faster than I/O
• Relative number of I/O devices must grow to compensate
• Storage component reliability not increasing with capacity
> Failure is not an option – it’s guaranteed
Trend to shared file systems
• Multiple compute clusters
• Direct access from specialized systems
Storage scalability critical
Lustre User Group 2009
3
HPC Center of the Future
Capability
500,000 Nodes
Capacity 1
250,000 Nodes
Capacity 2
150,000 Nodes
Capacity 3
50,000 Nodes
Test
25,000 Nodes
Viz
1
Viz
2
WAN
Access
Shared Storage Network
10 TB/sec
User Data
1000 MDTs
Metadata
25 MDS’s
HPSS
Archive
Lustre Storage Cluster
Lustre User Group 2009
4
Lustre Scalability
Definition
• Performance / capacity grows nearly linearly with hardware
• Component failure does not have a disproportionate impact
on availability
Requirements
•
•
•
•
•
Scalable I/O & MD performance
Expanded component size/count limits
Increased robustness to component failure
Overhead grows sub-linearly with system size
Timely failure detection & recovery
Lustre User Group 2009
5
100000
10000
Lustre User Group 2009
10
1
Lustre 2.0
Lustre 1.8
Future
100
Future
1000
Lustre 3.0
MDS
Lustre 3.0
Lustre 2.0
1000000
Lustre 1.6
Clients
Lustre 1.8
Lustre 1.6
Future
Lustre 3.0
Lustre 2.0
Lustre 1.8
Lustre 1.6
Lustre Scaling
OSS
3000
2500
2000
1500
1000
500
0
6
Architectural Improvements
Clustered Metadata (CMD)
• 10s – 100s of metadata servers
• Distributed inodes
> Files local to parent directory entry / subdirs may be non-local
• Distributed directories
> Hashing  Striping
• Distributed Operation Resilience/Recovery
> Uncommon HPC workload
- Cross-directory rename
> Short term
- Sequenced cross-MDS ops
> Longer term
- Transactional - ACID
- Non-blocking - deeper pipelines
- Hard - cascading aborts, synch ops
Lustre User Group 2009
7
Epochs
Oldest
Epoch
Global Oldest Volatile Epoch Reduction Network
Current Globally Known
Newest
Oldest Volatile Epoch
Epoch
Stable
Unstable
Committed
Uncommitted
Updates
Server 1
Server 2
Operations
Server 3
Client 1
Local Oldest
Volatile Epochs
Client 2
Redo
Lustre User Group 2009
8
Architectural Improvements
Fault Detection Today
• RPC timeout
> Timeouts must scale O(n) to distinguish death / congestion
• Pinger
> No aggregation across clients or servers
> O(n) ping overhead
• Routed Networks
> Router failure can be confused with end-to-end peer failure
• Fully automatic failover scales with slowest time constant
> Many 10s of minutes on large clusters 
> Failover could be much faster if “useless” waiting eliminated 
Lustre User Group 2009
9
Architectural Improvements
Scalable Health Network
• Burden of monitoring clients distributed – not replicated
> ORNL – 35,000 clients, 192 OSSs, 7 OSTs/OSS
• Fault-tolerant status reduction/broadcast network
> Servers and LNET routers
• LNET high-priority small message support
> Health network stays responsive
• Prompt, reliable detection
> Time constants in seconds
> Failed servers, clients and routers
> Recovering servers and routers
Interface with existing RAS infrastructure
• Receive and deliver status notification
Lustre User Group 2009
10
Health Monitoring Network
Primary Health Monitor
Failover Health Monitor
Client
Lustre User Group 2009
11
Architectural Improvements
Metadata Writeback Cache
• Avoids unnecessary server communications
> Operations logged/cached locally
> Performance of a local file system when uncontended
• Aggregated distributed operations
> Server updates batched and tranferred using bulk protocols
(RDMA)
> Reduced network and service overhead
Sub-Tree Locking
> Lock aggregation – a single lock protects a whole subtree
> Reduce lock traffic and server load
Lustre User Group 2009
12
Architectural Improvements
Current - Flat Communications model
• Stateful client/server connection required for coherence and
performance
• Every client connects to every server
• O(n) lock conflict resolution
Future - Hierarchical Communications Model
• Aggregate connections, locking, I/O, metadata ops
• Caching clients
> Aggregate local processes (cores)
> I/O Forwarders scale another 32x or more
• Caching Proxies
> Aggregate whole clusters
> Implicit Broadcast - scalable conflict resolution
Lustre User Group 2009
13
Hierarchical Communications
Lustre Storage
Cluster
Proxy Cluster
Cluster
Proxy
Proxy Server
WBC Client
MDS
MDS
MDS
MDS
MDS
MDS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
Lustre Client
IO
Forwarding
Server
IO Forwarding
Client
I/O Forwarder
User Proc.
Proxy Server
WBC Client
OSS
Proxy
Server
User Proc.
Proxy
Server
User Proc.
IO Forwarding
Client
User Proc.
Proxy Server
WBC Client
Proxy
Server
WBC Client
OSS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
User Proc.
OSS
OSS
OSS
Proxy Server
OSS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
OSS
User Proc.
I/O Forwarder
IO Forwarding
Client
User Proc.
User Proc.
User Proc.
OSS
IO
Forwarding
Server
Lustre Client
Lustre Client
WBC Client
Proxy
Server
User Proc.
WAN / Security Domain
Luster Comm
WBC Client
User Proc.
System Calls
Performance Improvements
Network Request Scheduler (NRS)
• Much larger working set than disk elevator
• Higher level information - client, object, offset, job/rank
Prototype
• Initial development on simulator
• Scheduling strategies - quanta, offset, fairness etc.
• Testing at ORNL pending
Future
• Exchange global information - gang scheduling
• QoS - Real time / Bandwidth reservation (min/max)
Lustre User Group 2009
15
Performance Improvements
SMP Scaling
• Improve MDS performance / small message handling
• CPU affinity
# Client
• Finer granularity locking
RPC Througput
RPC Trhoughput
Nodes
Total client processes
Lustre User Group 2009
Total client processes
16
Performance Improvements
Metadata Protocol
• Size on MDT (SOM)
> Avoid multiple RPCs for attributes derived from OSTs
> OSTs remain definitive while file open
> Compute on close and cache on MDT
• Readdir+
> Aggregation
- Directory I/O
- Getattrs
- Locking
Lustre User Group 2009
17
Performance Improvements
ZFS
• Remove ldiskfs size limits
• End-to-end data integrity
• Hybrid storage
Channel bonding
• Combine multiple Network Interfaces
• Failover
• Capacity
Lustre User Group 2009
18
Performance Improvements
Rebuild performance
• Frequent disk failures
> Rebuild quickly to prevent data loss on next failure
• Disk group remains in operation during rebuild
> Avoid using OST during rebuild
- Speed rebuild
- Amdahl’s law
• ZFS rebuild improvements
Lustre User Group 2009
19
Lustre Scalability
Attribute
Today
Future
Number of Clients
10,000s
Flat comms model
1,000,000s
Hierarchical comms model
Server Capacity
Ext3 – 8TB
ZFS - Petabytes
Metadata Performance
Single MDS
Single MDS improvements
CMD
Recovery Time
RPC timeout - O(n)
Health Network - O(log n)
Lustre User Group 2009
20
THANK YOU
Eric Barton
[email protected]
[email protected]
Lustre User Group 2009
21