Lustre tuning

Intel® Lustre* system and network administration
Lustre tuning
11.2016
Intel Confidential — Do Not Forward
Lustre Tuning
Lustre has many options for tuning the file system
Some tuning is done at file system creation
•
Creating and Defining an External Journal
•
Setting the Stride and Stripe-Width
Some is performed after file system creation
Tuning is often a trade-off between performance and stability
Tuning is an iterative process
•
Benchmark, tune in one direction, retest, tune further/back based on results
•
Repeat for next tuning option
Can get very complex very quickly, so:
•
Always best to start with the "low hanging fruit"
Intel Confidential — Do Not Forward
2
Linux IO Scheduler
IO scheduler (elevator) comparison
•
[root@st02-oss1 ~]# uname -r
2.6.32-358.11.1.el6_lustre.x86_64
[root@st02-oss1 ~]# cat /sys/block/sd*/queue/scheduler
Default kernel elevator (CFQ) is wrong
for Lustre
Ensure use of the deadline IO
scheduler
•
Lustre patched kernel already uses
the “deadline” scheduler by default
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
Intel Confidential — Do Not Forward
3
Linux – Other non-Lustre params
Linux Network Stack (just some of the options)
•
TCP read / write buffers (default)
•
•
TCP read / write buffers (maximum)
•
•
net.ipv4.tcp_rmem / net.ipv4.tcp_wmem
net.core.rmem_max / net.core.wmem_max
Queue length (maximum)
•
Receive
•
•
Transmit
•
•
net.core.netdev_max_backlog
txqueuelen
FlowControl, TCP Window Scaling, etc.
Linux Kernel Memory
https://www.kernel.org/doc/Documentation/sysctl/vm.txt
Intel Confidential — Do Not Forward
Example parameters
check with your network vendors:
•
net.ipv4.tcp_timestamps=1
•
net.ipv4.tcp_low_latency=1
•
net.core.rmem_max=4194304
•
net.core.wmem_max=4194304
•
net.core.rmem_default=4194304
•
net.core.wmem_default=4194304
•
net.core.optmem_max=4194304
•
net.ipv4.tcp_rmem=4096 87380 4194304
•
net.ipv4.tcp_wmem=4096 65536 4194304
4
Linux – tuned to simplify
tuned is a daemon that monitors the use of system components and dynamically tunes system
settings based on that monitoring information.
Dynamic tuning accounts for the way that various system components are used differently throughout
the uptime for any given system.
To install tuned on a RHEL/Centos 6.x:
yum install tuned-utils.noarch tuned.noarch
To list all the profiles available:
tuned-adm list
To set a profile:
tuned-adm profile throughput-performance
To verify which profile is active:
tuned-adm active
Intel Confidential — Do Not Forward
5
What is in the throughput-performance profile?
set_cpu_governor performance
set_transparent_hugepages always
Scheduler => deadline
kernel.sched_min_granularity_ns = 10000000
kernel.sched_wakeup_granularity_ns = 15000000
vm.dirty_ratio = 40
tuned monitor = off
Intel Confidential — Do Not Forward
The CPUfreq governor "performance” sets the CPU
statically to the highest frequency within the
borders of scaling_min_freq and scaling_max_freq.
6
MDS Tuning – IOPS, RAM
Storage with high IOPS characteristics are best
•
High IOPS storage configured in RAID 10 array and / or SSD
Take advantage of Linux write through caching
•
Amount of RAM is greater than size of MDT
•
Cache all the data from the MDT or as much of the working set as you can afford
Improved MDS SMP performance in Lustre 2.3
•
Lustre’s metadata code is still CPU bound. Use high frequency Intel CPU
Distributed Namespace allows for multiple MDTs in each Lustre file system
•
Single metadata target can be a bottleneck
•
Remote directory implementation in Lustre 2.4
•
Directories striped across MDTs in Lustre 2.7
•
Recommend maximum of 4 MDT for each MDS (balance of increased IO against CPU and memory contention)
Intel Confidential — Do Not Forward
7
OSS Tuning – Throw Hardware At It
OSS Backplane
Type and placement of HBA's (NUMA)
Number of CPU cores
•
Improved LNET/OSS SMP performance in Lustre 2.3
CPU / thread ratio
OSS bandwidth often limited by:
•
Speed of the network with modern storage arrays
•
Speed of the backend storage
•
Speed of the I/O controller(s) (or PCI-E slot)
• Increasing amount of controllers can help offset this bottleneck
Intel Confidential — Do Not Forward
8
Disks: SATA / Near Line SAS
SATA - Enterprise vs. Consumer Grade
•
Be aware of the many differences – use only Enterprise Grade
NL-SAS is Enterprise SATA w/ SAS Interface ++
•
Higher speed interface, longer cabling, better features, etc.
•
Similar costs vs SATA effectively makes using SATA a bad choice
SATA / NL-SAS - Not recommended for MDT
•
MDTs need more IOPS than these disks can provide
•
If you insist, an external journal is strongly advised:
# mke2fs –O journal_dev –b 4096 /dev/sdf
# mkfs.lustre --mkfsoptions “-j -J device=/dev/sdf" --ost /dev/sda
When SATA / NL-SAS is used in OSTs for large block sequential IO:
•
Price / performance compared to SAS disks is excellent
Intel Confidential — Do Not Forward
9
Disks: SSD and fast Storage Array
SSD – many options available
•
Intel (and other vendors) provides several models specialized on different workload
•
Lustre needs for MDT high IOPS for Reads AND Writes and low latency
•
Verify the QoS and the endurance
•
<<Bigger is better>> for endurance and performance
•
Using ldiskfs as backend for MDT, use larger journal (e.g. 2GB) to avoid being IOps-bound
Fast Storage Array
•
Modern, very fast storage arrays need larger journal (2GB+) than default
•
The design of the chain from the HBA to the storage backplane is really important especially in
NUMA servers
•
sgp_dd-survey and obdfilter-survey can help to identify bottlenecks
Intel Confidential — Do Not Forward
10
Asynchronous Journal on OST
Prior to 1.8, all OST I/O was synchronous
•
When OST sent commit to client, all data was on the disk
•
Required forcing a flush after every bulk write
The option for async journaling was added in 1.8
•
Block data is still written synchronously
•
OST journal transactions are written asynchronously
•
Reduced number of small journal I/Os = better performance
•
Single client can push many IO's and get "commits" faster
• As the journaling entries remain in cache
From 2.x on, async journal is enabled by default
Intel Confidential — Do Not Forward
11
Tuning LDISKFS backend file systems – OST
Lustre "tries" to aggregate all I/O into 1 MB increments. Want writes to be aligned in order to avoid read-modify-write penalty
•
mballoc() tries to locate I/O aligned and contiguous disk blocks. Ext-based file systems try to locate all the blocks in a file within a block group
1MB reads/writes aligned via stride & stripe-width
•
Block size is 4 KB
•
stride equals the number of blocks written per disk before writing the next stride to the next disk
•
stripe-width equals the number of blocks per stripe
•
Note: these two parameters are exposed by mke2fs
Ensure that "stripe width" equals "stride" times the amount of “data” disks in the RAID set
Example: 10 disk RAID6 array (equivalent of 8 data disks, 2 parity) using 4K blocks
•
Stride = 1024 KB / 8 disks = 128KB / disk, and 128KB/disk / 4KB/block = 32
•
Stripe-width (in blocks) = 1024KB / 4KB = 256
•
Check: 256 = 32 x 8, thus we are good to go:
# mkfs.lustre --mkfsoptions=“-E stride=32,stripe_width=256” --ost –mgsnode=192.168.0.22@tcp /dev/sda1
Intel Confidential — Do Not Forward
12
OST inodes and LUN settings
Blocks per inode ratio
•
EXT* default is to create one (1) inode for every 16 KB
•
Appropriate for Enterprise, not ideal for typical HPC apps
•
Instead, set the blocks per inode ratio higher
•
1 inode per 256/512 KB might be more appropriate for large block sequential IO
•
Set with --mkfsoptions=" " when formatting OSTs
Configure LUNs for performance
•
Write through caching
•
Read ahead
•
Max sectors / KB
Intel Confidential — Do Not Forward
13
OST Striping
Recall that speed comes from parallel IO
•
Accomplished by striping files across OSTs
Striping files across more OSTs
•
An "obvious" way to improve performance, typically
Max stripe size count originally limited to 160 OSTs
•
Restriction caused by size of EA in MDT inodes
Lustre versions 2.2+ support wide-striping (has to be enabled at format time)
•
Maximum 2000 OSTs per file
Intel Confidential — Do Not Forward
14
Expanding the file system – indirect tuning
Simple (software wise) to add more OSTs
Format and mount new OSTs
Clients automatically learn of/use new OSTs
•
Will support larger stripe sizes for files
Ideally, should rebalance after mounting
•
Any OSTs previously "nearly-full" will have faster IO
Intel Confidential — Do Not Forward
15
Software RAID using MDRAID
Software RAID using MDRAID on OSS servers is not advised
Benefits include:
•
Lower capital cost to purchase and upgrade
•
Higher "potential" maximum performance and fewer hardware components to fail
•
Vendor agnostic; mix and match drive types (to a degree)
Downsides include:
•
Usually more complicated to manage
•
Recovery/rebuild is typically slower/longer
•
Uses host CPUs to perform RAID calculations. Excessive use of CPU for storage management affects overall IO
If you still insist:
•
RAID-6 for better reliability
•
Ensure that you specify (with –c) the proper chunk size when formatting
Consider ZFS as an alternative
Intel Confidential — Do Not Forward
16
Application / Site specific
Lustre tuning
Intel Confidential — Do Not Forward
17
OSS – Service Threads
Three different types of OSS threads
•
Used for statfs and object creation
•
•
ost_creat (actually ll_ost_creat_XX, were XX is OST index)
All other operations (read/write, truncate, setattr, etc)
•
ost (also ll_ost_XX)
•
ost_io (ditto)
Correct setting depends on:
•
Speed of the storage (hardware, running synchronous, etc.)
•
Number of OSTs exported from the OSS
•
Capacity of the server
•
Workload from clients
Intel Confidential — Do Not Forward
18
OSS – Service Threads
Two ways to manage service threads:
1.
Set the initial number of threads when the module loads
•
Configured when the module loads
•
For ost_creat threads
options oss oss_num_create_threads=8
•
For ost and ost_io threads
options oss oss_num_threads=64
•
Minimum initial thread count is 2 (max is 512)
2.
Set the max (and min) number of threads
•
Lustre starts more service threads as needed. Threads increase up to 4x the minimum or the maximum (never > 512)
•
Example: set max ost_io threads
# lctl set_param ost.OSS.ost_io.threads_max=128
•
Determine how many threads have been started
# lctl get_param ost.OSS.ost_io.threads_started
Intel Confidential — Do Not Forward
19
OSS Threads – Tuning Guidelines
Increase the number of threads if:
•
Several OSTs are exported from an OSS
•
Back-end storage is running in synchronous mode (Commits are not cached)
•
I/O completions take excessive time due to slow storage
Decrease the number of threads if:
•
Storage is being overwhelmed
•
There are “slow I/O” or watchdog messages on clients
•
OSS appears to be resource constrained (such as CPU load or RAM utilization are excessively high)
Additional information:
•
Thread tuning is applied similarly for the MDS
•
MDS/OSS SMP thread affinity is supported in 2.3+
•
Threads more likely to access "hot" caches
Intel Confidential — Do Not Forward
20
OSS Read Cache – LDISKFS only
Linux provides (for "free") read caching
•
Data is cached in unused memory
•
However, this data is frequently overwritten as there is more IO than the memory available for caching
Lustre provides an OSS read cache tuneable
•
Ideally, we would want to cache all the files (by default)
•
However, this is not practical
•
•
•
Not enough memory available
Results in too much thrashing inside the cache
Instead, setting can be used to cache the "small" files
First, define the max size for a small file
Next, instruct the OSS to cache those files
# lctl conf_param obdfilter.*.readcache_max_filesize=5
Intel Confidential — Do Not Forward
21
MDS – Service Threads
Work similarly to the OSS service threads
MDS threads are:
•
mds
•
mds_readpage
•
mds_setattr
Define thread count when loading the module
options mds mds_num_threads=XX
(or) Let Lustre auto-tune the thread count
•
Setting min, max and getting num started applies here, also
lctl {get,set}_param {service}.thread_{min,max,started}
See Lustre manual for details
Intel Confidential — Do Not Forward
22
MDS – Caching
Presented earlier was the fact that fully caching of OST data is not practical
However, it can be feasible to cache most or all of a MDT
Accomplished utilizing Linux's native read cache
The only configuration requirement is lots of RAM
Intel Confidential — Do Not Forward
23
Client Caching – Inactive Data
Lustre provides caching similar to Linux
Read the existing setting:
lctl get_param llite.*.max_cached_mb
Tuneable to set amount of inactive data cached:
lctl set_param llite.*.max_cached_mb=512
•
¾ RAM is the default setting
Intel Confidential — Do Not Forward
24
Client Caching – Active Data
Performance gains caching some dirty data
•
Clients performing small IO need not transmit every write
If caching too much dirty data
•
Events that force a cache flush may block client thread
•
Also, more risks if system experiences an interruption
Dirty cache size is set per-OST
•
Set in: /proc/fs/lustre/osc/<OST name>/max_dirty_mb
•
Default is 32MB, max is 1024MB
lctl get_param osc.*.max_dirty_mb
Dirty cache backed with "granted" space on OSS's
•
Ensures I/O completion
Intel Confidential — Do Not Forward
25
Client Tuning – Read-ahead
Lustre maintains a read-ahead value
•
Value starts at 1MB and increments linearly
•
Value increases when there are 2 sequential Linux buffer cache misses
•
Value increases up to the defined maximum:
/proc/fs/lustre/llite/<fsname>/max_read_ahead_mb
•
Default maximum is 40MB
•
Disabled when set to 0
# lctl get_param llite.*.max_read_ahead_mb
•
Non-sequential IO sets read-ahead value back to 1MB
Intel Confidential — Do Not Forward
26
Client Tuning - Readahead
Read entire “small” file into cache on first access
client# lctl set_param llite.*.max_read_ahead_mb=10
client# lctl set_param llite.*.max_read_ahead_per_file_mb=6
client# lctl set_param llite.*.max_read_ahead_whole_mb=5.5
Cache only files < 6MB on OSS, avoid cache thrashing
oss# lctl set_param obdfilter.*.readcache_max_filesize=6M
Intel Confidential — Do Not Forward
27
Client Tuning - Statahead
Max number of directory entries to be pre-cached when a directory is stat'd
/proc/fs/lustre/llite/<fs-id>/statahead_max
lctl get_param llite.*.statahead_max
Read-only variable showing current status
/proc/fs/lustre/llite/<fs-id>/statahead_status
lctl get_param llite.*.statahead_status
Intel Confidential — Do Not Forward
28
Client Tunables - Networking
Goal: Keep network pipe full but not overloaded
Control how much data each client can send
•
Maximum number of RPCs that the OST can have pending per client
•
Eight (8) is often considered optimal
•
May be increased on faster networks
•
May be increased with larger OST stripe counts
Maximum number RPC for MDT is 1
# lctl set_param fail_loc=0x804 # to disable
•
No recovery is possible (be careful!)
•
Only for benchmarking
Intel Confidential — Do Not Forward
29
Client Tunables – Networking
Optimal settings depend on client / network
Max number of 4K pages per RPC
•
Ideally 256 => 1MB per RPC
/proc/fs/lustre/osc/<OST name>/max_pages_per_rpc
lctl get_param osc.*.max_pages_per_rpc
Max RPCs in flight between OSC and OST
•
Range is 1-256
/proc/fs/lustre/osc/<OST name>/max_rpcs_in_flight
lctl set_param osc.*.max_rpcs_in_flight=256
LNet Credits and LND Peer Credits
Intel Confidential — Do Not Forward
30
Client Tunables – All available
How to list tunables without ls /proc
lctl get_param -NF osc.*.*
lctl get_param -NF llite.*.*
Client import state
lctl get_param osc.*.import
Intel Confidential — Do Not Forward
31
Tuning Scenario - #1
Clients writing large, sequential block I/O
•
This is Lustre's sweet spot
•
System is designed, built and functioning properly?
On OSS tunables:
•
Disable read cache
lctl set_param obdfilter.*.read_cache_enable=0
•
zone_reclaim_mode = 1 (on numa only)
•
swappiness = 10
Intel Confidential — Do Not Forward
32
Tuning Scenario - #2 (1/2)
Clients writing random or small block IO – this is the worst case IO scenario for Lustre
Determine application performing this IO
•
Use the brw_stats, rpc_stats and extents files from /proc
•
Verify using:
# strace –T –ttt –p <pid-of-suspicious-app>
•
As (strace + args) will show the IO size
See if developer will optimize the IO components
Increase the write back cache
•
This allows the per-OST IO to build up on the client
•
Less I/O's to the servers, less waiting, more processing
•
Tunable is /proc/fs/lustre/osc/*/max_dirty_mb
Intel Confidential — Do Not Forward
33
Tuning Scenario - #2 (2/2)
OSS: optimally set readcache_max_filesize to cache files
lctl set_param obdfilter.*.readcache_max_filesize=6M
Clients: optimally set read ahead
lctl set_param llite.*.max_read_ahead_whole_mb=5.5
lctl set_param llite.*.max_read_ahead_per_file_mb=10
Consider increasing the rpc in flight
Random IO to one large file? Maximize stripe count
•
File will get IOPS from all OSTs
Random IO to many small files?
•
Set stripe count to 1
•
Spread distribution of files across many/all OSTs
•
Scenario will still have large MDS overhead
If this is the case with many/all apps, configure OST storage as RAID-10 (more IOps on RAID-10 versus RAID-6) or use SSD
Intel Confidential — Do Not Forward
34
Tuning Scenario - #3
Clients waiting on IO
Seen by many entries for RPC's in flight #8 in brw_stats
lctl get_param osc.*.rpc_stats
If significant, increase the number of RPC's in flight
/proc/fs/lustre/osc/<OST name>/max_rpcs_in_flight
lctl set_param osc.*.max_rpcs_in_flight=<n>
How far?
•
Until the highest RPC number is much less used.
•
Verify the peers and increase also LNet peers and LNet peer credits
•
Use an iterative approach when increasing
Intel Confidential — Do Not Forward
35
Tuning Scenario - #4
Backend disk I/O is slow
Seen by applications stalling waiting for IO
Increase the number of OSS threads
•
Initial thread count
•
Maximal thread count
How far?
•
Again, use the iterative approach
•
A good starting point is 32 threads per OST
Intel Confidential — Do Not Forward
36
Tuning Scenario - #5 (1/2)
High Bandwidth / High Latency Link
•
Possibly a WAN or MAN
•
See: http://www.kehlet.cx/articles/99.html
The problem is latency, so increase:
•
Client read ahead cache
/proc/fs/lustre/llite/<fsname>/max_read_ahead_mb
/proc/fs/lustre/llite/<fsname>/max_read_ahead_whole_mb
•
Write behind cache
/proc/fs/lustre/osc/<OST name>/max_dirty_mb
Intel Confidential — Do Not Forward
37
Tuning Scenario - #5 (2/2)
Increase max RPC's in flight
Increase LNet credits to match RPC's in flight
Increase LND peer_credits also
If using o2iblnd
•
Set concurrent_sends manually
•
o2iblnd may/may not determines good value based on peer_credits
If using socklnd
•
Increase the TCP send buffer
•
Increase Tx/Rx window sizes
Intel Confidential — Do Not Forward
38
Tuning Scenario - #6
Intel True Scale Infiniband Card
•
Verbs implementation on-load (cpu bound)
•
Disable HyperThreading
•
Set cpu_governor to performance
•
QIB options:
options ib_qib singleport=1 pcie_caps=0x51 krcvqs=4 rcvhdrcnt=4096
Recommended LNet IB configuration parameters for Intel True Scale InfiniBand:
options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 \
concurrent_sends=256 ntx=2048 map_on_demand=32 \
fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1
Intel Confidential — Do Not Forward
39
Tuning Scenario - #7
Intel SSD disks
•
Verify the configuration with FIO prior to use with lustre
•
Avoid any write cache (read cache is disabled on Intel SSDs)
hdparm –W0 /dev/sdb
•
Increase the journal to 2GB during lustre’s format
--mkfsoptions="-J size=2048”
•
Scheduler: deadline
•
Verify endurance:
smartctl -a /dev/sda |grep 233
Intel Confidential — Do Not Forward
40
Legal Information
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information
to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No
computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/content/www/us/en/software/intel-solutions-for-lustre-software.html.
Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer.
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive,
royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty
arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain
the latest forecast, schedule, specifications and roadmaps.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS
FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND
EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT
LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN,
MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or
"undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to
change without notice. Do not finalize a design with this information.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
© 2016 Intel Corporation
41
Intel Confidential — Do Not Forward