Lustre performance
monitoring and trouble
shooting
March, 2015
Patrick Fitzhenry and Ian Costello
©2012 DataDirect Networks. All Rights Reserved.
1
ddn.com
Agenda
►
EXAScaler (Lustre) Monitoring
• NCI test kit hardware details
• What is it? How does it work
• Demo
►
Lustre trouble-shooting
• General points
• 4 examples
©2012 DataDirect Networks. All Rights Reserved.
2
ddn.com
Introduction
►
Patrick Fitzhenry
• Director, Technical Services & Support, South Asia & ANZ
►
Ian Costello
• Senior Application Support Engineer
©2012 DataDirect Networks. All Rights Reserved.
3
ddn.com
Lustre Performance Monitoring
4
©2012 DataDirect Networks. All Rights Reserved.
4
ddn.com
NCI test kit hardware details
20 x Fujitsu compute nodes
Dual E5-2670, 2.60GHzProcessors, 32GB
Single Rail FDR
Metadata
SFA12KX-40
400x3TB NL-SAS
4xOSS’s:
• Dual E5-2670
• 128GB
• CENTOS 6.4
©2012 DataDirect Networks. All Rights Reserved.
12 x 600GB 15K SAS
2xMD’s:
• Dual E5-2670
• 128GB
• CENTOS 6.4
5
ddn.com
Lustre Monitoring Background
►
DDN development project
► Use information Linux's /proc
► Goals:
•
•
•
•
•
•
•
Collect near real-time data (minimum every 1sec) and visualize them
All Lustre statistics information can be collectable
Support Lustre-1.8.x, 2.x version and beyond
Application aware monitoring (Job stats)
Administrator can make any custom graphs on the web browser
Configurable, intuitive dashboard
Scalable, Light weight and no performance impacts
• and it is quite helps for debug and I/O analysis.
►
►
6
Lustre is distributed, scalable filesystem. The monitoring/analysis
tool must be aware of this.
Lustre monitoring tool helps understanding current/past filesystem
behavior and prevents slowdown of performance
©2012 DataDirect Networks. All Rights Reserved.
6
ddn.com
ExaScaler Monitoring
• File system, OST Pool, OST/MDT stats, etc.
• JOB ID, UID/GID, aggregation of application's
stats, etc.
• Archive of data by policy
•
•
•
•
Lightweight
Near real-time
Massive scale
Customizable
Monitoring Server
OSS/MDS
collectd
Graphite plugin
collectd
Lustre client
DDN monitoring
plugin
UDP(TCP)/IP
based small
text message
transfer
graphite
graphite
©2012 DataDirect Networks. All Rights Reserved.
7
ddn.com
Opentsdb Architecture
►
The end to end Opentsdb work flow:
©2012 DataDirect Networks. All Rights Reserved.
8
ddn.com
A new Lustre plugin for collectd
►
Using Collectd (http:///collectd.org)
• Running at many Enterprise/HPC system
• Written in C for performance and portability
• Includes optimizations and features to handle hundreds of thousands
of data sets.
• Comes with over 90 plugins which range from standard cases to very
specialized and advanced topics.
• Provides powerful networking features and is extensible in numerous
ways
• Actively developed and supported and well documented
►
Lustre plugin extended collectd to collect Lustre statistics
while inheriting its advantages
► It is possible to port Lustre plugin to a better framework if
necessary
9
©2012 DataDirect Networks. All Rights Reserved.
9
ddn.com
XML definition of Lustre's /proc
information
►
►
Tree structured descriptions about how to collect
statistics from Lustre proc entries
Modular
• A hierarchical framework comprised by a core logic layer (Lustre
plugin) and statistics definition layer (XML files)
• Extendable without the need to update any source codes of Lustre
plugin
• Easy to maintain the stableness of core logic
►
Centralized
•
•
•
•
10
A single XML file for all definitions of Lustre data collection
No need to maintain massive error-prone scripts
Easy to verify correctness
Easy to support multiple versions and update for new versions of
Lustre
©2012 DataDirect Networks. All Rights Reserved.
10
ddn.com
XML definition of Lustre's /proc
information
►
Precise
• Strict rules using regular expression could be configured to filter
out all but what we exactly want
• Locations to save collected statistics are explicitly defined and
configurable
►
Powerful
• Any statistics could be collected as long as there is proper regular
expressions to match it
►
Extendable
• Any newly wanted statistics could be collected in no time by adding
definition in XML file
►
Efficient
• No matter how many definitions are predefined in the XML file, only
under-used definitions will be traversed at run-time.
11
©2012 DataDirect Networks. All Rights Reserved.
11
ddn.com
Example of a collectd.conf
This is an example of a /etc/collectd.conf from an MDS (tmds1):
[root@tmds1 ~]# cat /etc/collectd.conf
#
# collectd.conf for DDN LustreMon
#
Interval
5
WriteQueueLimitHigh 1000000
WriteQueueLimitLow 800000
LoadPlugin match_regex
LoadPlugin syslog
<Plugin syslog>
#LogLevel info
LogLevel err
</Plugin>
LoadPlugin lustre
<Plugin "lustre">
<Common>
DefinitionFile "/etc/lustre-ieel-2.5_definition.xml"
</Common>
# OST stats
# <Item>
# Type "ost_kbytestotal"
# Query_interval 300
# </Item>
# <Item>
# Type "ost_kbytesfree"
# Query_interval 300
# </Item>
<Item>
Type "ost_stats_write"
</Item>
<Item>
Type "ost_stats_read"
</Item>
©2012 DataDirect Networks. All Rights Reserved.
12
ddn.com
Example of a collectd.conf (continued)
# MDT stats
# <Item>
# Type "mdt_filestotal"
# Query_interval 300
# </Item>
# <Item>
# Type "mdt_filesfree"
# Query_interval 300
# </Item>
<Item>
Type "md_stats_open"
</Item>
<Item>
Type "md_stats_close"
</Item>
<Item>
Type "md_stats_mknod"
</Item>
<Item>
Type "md_stats_unlink"
</Item>
<Item>
Type "md_stats_mkdir"
</Item>
<Item>
Type "md_stats_rmdir"
</Item>
<Item>
Type "md_stats_rename"
</Item>
<Item>
Type "md_stats_getattr"
</Item>
<Item>
Type "md_stats_setattr"
</Item>
<Item>
Type "md_stats_getxattr"
</Item>
<Item>
Type "md_stats_setxattr"
</Item>
<Item>
Type "md_stats_statfs"
</Item>
<Item>
Type "md_stats_sync"
</Item>
©2012 DataDirect Networks. All Rights Reserved.
13
ddn.com
Example of a collectd.conf (continued)
<Item>
Type "ost_jobstats"
<Rule>
Field "job_id"
</Rule>
</Item>
<Item>
Type "mdt_jobstats"
<Rule>
Field "job_id"
</Rule>
</Item>
<ItemType>
Type "mdt_jobstats"
<ExtendedParse>
# Parse the field job_id
Field "job_id"
# Match the pattern
Pattern "u([[:digit:]]+)[.]g([[:digit:]]+)[.]j([[:digit:]]+)"
<ExtendedField>
Index 1
Name pbs_job_uid
</ExtendedField>
<ExtendedField>
Index 2
Name pbs_job_gid
</ExtendedField>
<ExtendedField>
Index 3
Name pbs_job_id
</ExtendedField>
</ExtendedParse>
TsdbTags "pbs_job_uid=${extendfield:pbs_job_uid} pbs_job_gid=${extendfield:pbs_job_gid} pbs_job_id=${extendfield:pbs_job_id}"
</ItemType>
<ItemType>
Type "ost_jobstats"
<ExtendedParse>
# Parse the field job_id
Field "job_id"
# Match the pattern
Pattern "u([[:digit:]]+)[.]g([[:digit:]]+)[.]j([[:digit:]]+)"
<ExtendedField>
Index 1
Name pbs_job_uid
</ExtendedField>
©2012 DataDirect Networks. All Rights Reserved.
14
ddn.com
Example of a collectd.conf (continued)
<ExtendedField>
Index 2
Name pbs_job_gid
</ExtendedField>
<ExtendedField>
Index 3
Name pbs_job_id
</ExtendedField>
</ExtendedParse>
TsdbTags "pbs_job_uid=${extendfield:pbs_job_uid} pbs_job_gid=${extendfield:pbs_job_gid} pbs_job_id=${extendfield:pbs_job_id}"
</ItemType>
</Plugin>
loadPlugin "write_tsdb"
<Plugin "write_tsdb">
<Node>
Host "10.10.108.33"
Port "8500"
</Node>
</Plugin>
#loadPlugin "write_graphite"
#<Plugin "write_graphite">
# <Carbon>
# Host "172.21.66.181"
# Port "2003"
# Prefix "collectd."
# Protocol "udp"
# </Carbon>
#</Plugin>
©2012 DataDirect Networks. All Rights Reserved.
15
ddn.com
Demo
►
►
►
►
Show the OpenTSB layout
Show the Grafana layout
Show adding a mdt based stat, then update with a filter
to a jobid
Show adding a ost based stat
©2012 DataDirect Networks. All Rights Reserved.
16
ddn.com
Troubleshooting Lustre
17
©2012 DataDirect Networks. All Rights Reserved.
17
ddn.com
Process when Troubleshooting
Lustre
18
©2012 DataDirect Networks. All Rights Reserved.
18
ddn.com
Lustre debugging
►
Lustre is complex environment, lots of tightly coupled moving
parts:
•
•
•
•
•
•
•
Storage (data, metadata)
OSS
MDS
Network
Lustre Server
Lustre Client
Operating Systems
►
The software resides in kernel-space which makes it difficult
to to debug compared with user-space software.
► It is possible to debug Lustre
• Lustre bugs do get resolved – searching jira (if the issue is Lustre)
• A lot of tools have been developed specifically for Lustre debugging.
• The Lustre community is very active and provides strong support.
19
©2012 DataDirect Networks. All Rights Reserved.
19
ddn.com
What to do when a Lustre issue occurs
1
►
Understand the problem
• What is the failure type? (kernel crash/LBUG/system call
failure/stuck process/incorrect result/unexpected
behavior/performance regression)
• Which nodes cause the problem
Is it a server side problem or client side problem?
o Is it a problem limited to a single client?
o Is it a metadata or data access problem?
o
• How critical the problem is? The impacted services could be:
The whole system, e.g. crash or deadlock on MGS/MDS;
o All of the services on a server, e.g. crash or deadlock on OSS;
o A certain service of the whole system, e.g. quota failure on QMT/QSD;
o All of the operations on the client(s), e.g. crash or deadlock on client.
o
20
©2012 DataDirect Networks. All Rights Reserved.
20
ddn.com
What to do when a Lustre issue occurs
2
►
Find a simple and reliable reproduction method
• Step 1: Confirm which program causes the bug;
• Step 2: Write a simple program which can reproduce the problem
repeatedly3;
• Step 3: Simplify the program as much as possible.
• A simple and reliable reproduction method:
Simplifies the description of the issue thus helps other people
understand it quickly;
o Reduces the collected logs thus reduces the time to analyze it;
o Accelerates the confirmation of possible fix methods thus accelerates
the fix process.
o
21
©2012 DataDirect Networks. All Rights Reserved.
21
ddn.com
What to do when a Lustre issue
happens
3
►
Collect logs on the involved nodes
• System logs are always valuable to determine the states of Lustre nodes.
• Use ‘strace’ command to collect logs of system calls:
o
o
Which system call returns failure?
Which errno does this system call returns? Errno is essential for understanding and
debuging the issue, e.g. EIO(5) usually means disk I/O has some problems.
• Collect kernel dump file when crash happens
o
o
Kdump should always been enabled on production system.
It is especially useful for ‘NULL pointer dereference’.
• Collect Lustre messages for further analysis
• Tips:
o
o
o
o
22
A few lines of critical messages are much more helpful than other messages.
The first messages when the bug happens are more important.
Massive messages which are printed days before the bug happens is less valuable.
Redundancy messages are always better than lack of messages.
©2012 DataDirect Networks. All Rights Reserved.
22
ddn.com
What to do when a Lustre issue occurs
4
►
Collect Lustre messages
• Command: lctl debug_kernel
• Different masks can be used: trace, inode, super, ext2, malloc, cache,
info, ioctl, neterror, net, warning, buffs, other, dentry, nettrace, page,
dlmtrace, error, emerg, ha, rpctrace, vfstrace, reada, mmap,
config,console, quota, sec, lfsck, hsm
• Default masks are “warning, error, emerg, console”. But it might be
necessary to change mask to collect desirable messages.
23
Mask
Usage
trace
Useful for tracing the process flow of Lustre software stack. Frequently used.
quota
Useful for debuging quota problems.
dlmtrace
Useful for debuging LDLM problems.
ioctl
Useful for debuging ioctl problems.
malloc
Useful for debuging memory leak problems. Usually used together with
leak_finder.pl
©2012 DataDirect Networks. All Rights Reserved.
23
ddn.com
What to do when a Lustre issue
happens
5
►
Fix the issue
• Search whether the same issues has been fix in master branch of
Lustre git repository
o
Lustre mater branch is evolving quickly which means a lot of fixed bugs
might still exists on the older version.
• Search whether there is any similar issue reported
o
A fix/walk-around method might have proved to be successful.
• Keep the faith that a fix method will show up naturally as soon as
the problem is fully understood.
• Compromise if have to:
Find a temporary way to recover the service of the production system
quickly, e.g. reboot/e2fsck.
o If it is impossible to understand or fix the root cause of the issue right
now, try to find a way to walk around it.
o
24
©2012 DataDirect Networks. All Rights Reserved.
24
ddn.com
Real examples of fixing Lustre bugs 1
►
RM-135/LU-4478
• Problem discription: When formating a Lustre OST, the kernel crashes.
• Reproduce steps:
o
o
Apply a debug patch which returns failure from ldiskfs_acct_on()
Formatting a Lustre OST will trigger the crash
• Collected log: Kernel dump file collected by Kdump
• Analysis:
o
o
o
Log shows that the kernel crashes in ext4_get_sb()/get_sb_bdev()/
kill_block_super()/generic_shutdown_super()/iput()/clear_inode() because of
‘BUG: unable to handle kernel NULL pointer dereference at
00000000000001e0’
By using ‘crash’ commands, it is confirmed EXT4_SB((inode)->i_sb) is NULL
After further analysis, it is found that the failure of ldiskfs_acct_on() in
ldiskfs_fill_super() is not handled correctly.
• Fix: Add codes to handle failure of ldiskfs_acct_on() in
ldiskfs_fill_super() . (http://review.whamcloud.com/10938)
25
©2012 DataDirect Networks. All Rights Reserved.
25
ddn.com
Real examples of fixing Lustre bugs 2
►
RM-185/LU-5054
• Problem description: Creating and setting a pool name of length 16 to
a directory will succeed. However, creating a file under that directory
will fail.
• Reproduce steps:
o
o
[root@penguin1 ~]# lfs setstripe -p aaaaaaaaaaaaaaaa /lustre/dir2
[root@penguin1 ~]# touch /lustre/dir2/a
touch: cannot touch `/lustre/dir2/a': Argument list too long
• Errno: E2BIG(7)
• Collected log: Trace log of Lustre to check which function returns the
E2BIG errno.
• Analysis: Log shows that lod_generate_and_set_lovea() returns
E2BIG, because the pool name inherited from parent directory is
longer than the length limit.
• Fix: Cleanup all related codes to enforce a consistent length limit of
pool name. (http://review.whamcloud.com/10306)
26
©2012 DataDirect Networks. All Rights Reserved.
26
ddn.com
Real examples of fixing Lustre bugs 3
►
LU-5808
• Problem discription: When using one MGT to mange two file systems which names
are 'lustre' and 'lustre2T’, it is impossible to mount their MDTs on different servers
because parsing of MGS llog fails.
• Reproduce steps:
o
o
o
o
o
o
o
o
o
o
mkfs.lustre --mgs --reformat /dev/sdb1;
mkfs.lustre --fsname lustre --mdt --reformat --mgsnode=192.168.3.122@tcp --index=0 /dev/sdb2;
mkfs.lustre --fsname lustre2T --mdt --reformat --mgsnode=192.168.3.122@tcp --index=0 /dev/sdb3;
mount -t lustre /dev/sdb1 /mnt/mgs;
mount -t lustre /dev/sdb2 /mnt/mdt-lustre;
mount -t lustre /dev/sdb3 /mnt/mdt-lustre2T;
lctl conf_param lustre.quota.ost=ug;
mount -t ldiskfs /dev/sdb1 /mnt/ldiskfs;
llog_reader /mnt/ldiskfs/CONFIGS/lustre2T-MDT0000 | grep quota.ost;
The output of the last command is:
#10 (224)marker 8 (flags=0x01, v2.5.25.0) lustre 'quota.ost' Mon Oct 27 21:26:23 2014#11 (088)param 0:lustre 1:quota.ost=ug
#12 (224)marker 8 (flags=0x02, v2.5.25.0) lustre 'quota.ost' Mon Oct 27 21:26:23 2014-
• Collected log:
o
o
Trace log of Lustre to check which function returns the failure when mouting MDTs
Trace log of Lustre to check how does MGS handles llog names
• Analysis: Log shows that the MGS matches the llog of ‘lustre2T’ even when it tries
to update the llog of ‘lustre’
• Fix: Update codes of MGS to match llog name strictly to avoid invalid record
(http://review.whamcloud.com/12437)
27
©2012 DataDirect Networks. All Rights Reserved.
27
ddn.com
Performance Issue during commissioning (1)
Background:
► Lustre System being Commissioned in Asia
► DDN Storage, White box Servers, DDN Lustre
► HW assembled by third party contractor
• No pre or post installation documentation
Problem Statement:
► Low OSS Performance
► Failing Performance Acceptance tests
©2012 DataDirect Networks. All Rights Reserved.
28
ddn.com
Performance Issue during commissioning (2)
►
►
►
Local team spent many hours trying to resolve
Escalated to (remote) DDN APAC Lustre Support team
Steps to resolve:
• Determine what the problem is in the first case
o
Multiple tests to confirm where the problem is occurring
– ior and iozone
– obdfilter-survey
– lnet-selftest
– raw ib test utils ib_[write,read]_bw
– Make sure to specify the correct HCA you want to test on.
• Based on results from the above testing investigate the hardware
• lspci –vv was our friend
©2012 DataDirect Networks. All Rights Reserved.
29
ddn.com
Performance Issue during commissioning (3)
►
Resolution
• Onsite engineer moved 1 HCA to a 8 lane PCI on all servers
• Restart tests to confirm the fix – which it did and achieved the
10GB/s read/write performance profile.
©2012 DataDirect Networks. All Rights Reserved.
30
ddn.com
Performance Issue during commissioning (4)
►
20/20 Hind-sight is a beautiful thing:
• Obvious when the issue is known
►
Lessons learned:
• Need detailed documentation of installation – issue would have
been resolved easily if available
©2012 DataDirect Networks. All Rights Reserved.
31
ddn.com
What makes Lustre debugging easier?
Difficulty to debug
Easy
Middle
Hard
Ability to reproduce
Every time
Sometimes
Never
Time to reproduce
Seconds
Minutes
Hours
Program to reproduce
A few system calls
Single node application
Parallel application
Condition to reproduce
A certain condition of
a single process
Race condition with multiple
processes
Uncertain/Unknown
condition
Involved nodes
Client
MDS or OSS
Client & MDS & OSS
Involved software
components
Single component
Multiple components on a single
node
Multiple components on
multiple nodes with
RPCs
Ways of failing
Omission failure
(crash, request loss,
or no reply)
Commission failure (wrong process
of request, incorrect reply, corrupted
state)
Arbitrary/Byzantine
failure (unpredictable
result)
Types of error
Syntax error (compile
error)
Semantic defect (unintended result)
Design deficiency
Problem description
Clear description with
reproduction steps
Clear text description
Ambiguous description
Collected logs
Precise logs since the
bug occurred
Massive unfiltered logs
Not enough logs
32
©2012 DataDirect Networks. All Rights Reserved.
32
ddn.com
Fini – Questions?
33
©2012 DataDirect Networks. All Rights Reserved.
33
ddn.com
Lustre debugging
►
Lustre is a very complex piece of software which is hard
to debug
• It has a lot of software components with tightly coupled interfaces.
• It is a distributed file system with multiple types of nodes
connected together by network.
• The software resides in kernel-space which makes it difficult to to
debug compared with user-space software.
►
It is possible to debug Lustre
• Most bugs of Lustre get fixed eventually – searching jira.
• A lot of tools have been developed specifically for Lustre
debugging.
• The Lustre community is very active and provides strong support.
34
©2012 DataDirect Networks. All Rights Reserved.
34
ddn.com
Lustre DDN branch
Client Performance optimization
35
©2012 DataDirect Networks. All Rights Reserved.
35
ddn.com
Where ideas
become
reality
Genomic Analysis Application
►
It's a standardized job set (pipeline), but...
• More than 2000 jobs run in a single pipeline.
Alignment and mapping with genomics reference databases
o Annotations – adding references (metadata) to data
o Analysis by each application
o
• There are 100+ analysis applications. But, no MPI applications. A
lot of single jobs!
• Each applications have a lot of options/libraries
• All jobs are associated with job scheduler and allocated them very
efficiently.
• A lot of analysis pipelines are running on same HPC cluster
simultaneously.
36
Engineering Technical
Conference
2014
©2012 DataDirect Networks.
All Rights Reserved.
|
36
ddn.com
Where ideas
become
reality
Complex, Complex and Complex...
job204
job202
job305
job103
job3
job4
job303
job102
job5
Single Pipeline
job2
job104
job302
job101
job1
job203
Dependency
job105
After Finish job
job205
job106
job201
job301
job107
job304
job306
job206
job6
waiting jobs
37
Engineering Technical
Conference
2014
©2012 DataDirect Networks.
All Rights Reserved.
|
37
ddn.com
Pipeline aware I/O performance
monitoring
►
Developed Lustre Performance monitoring Tool
• Near realtime data point collection. (every second)
Performance monitoring is
• Any type of I/O monitoring is possible.
NOT only daily/hourly report,
(UID/GID/JOBID or any type of custom ID)
but it's really critical for
performance optimization.
ExaScaler Monitor
Total
Pipeline1
Pipeline2
Pipeline3
38
©2012 DataDirect Networks. All Rights Reserved.
Pipeline4
38
ddn.com
Where ideas
become
reality
Problem at MMBK
►
Pipeline job on lustre-2.5 client elapsed time is
longer than lustre-1.8 client system.
One analysis takes 2.5 days!
Job started
Finished job
lustre-2.5 client system
10hours
lustre-1.8 client system
Finished job
39
Engineering Technical
Conference
2014
©2012 DataDirect Networks.
All Rights Reserved.
|
39
ddn.com
Lustre performance optimization for genomic
applications
Worked with Intel exclusively and optimized current
Lustre-2.5 client codes for better I/O performance for
genomic applications.
►
mmap() I/O performance improvements
• Bug fixes, optimization and improvements
• BTW, there is an crucial issue with mmap() in GPFS
►
Performance improvements for single shared file
• Parallel read to same region of file from single client
►
CPU/Memory resource reduct
• A lot of CPU intensive application. CPU is always high usages
►
Large bulk I/O size support and enhancement
• Support to up 16MB I/O size (4MB was limit)
• Aggressive ReadAhead Engine for large I/O
40
©2012 DataDirect Networks. All Rights Reserved.
40
ddn.com
Fix mmap() performance problem and
improvements
mmap() read Performance (1MB block size)
450
Several application calls a lot of
mmap().10%+ of open() calls with
mmap()!
#
cat /proc/fs/lustre/llite/*/stats
400
After rework, 2.5x speed up from 1.8 client.
350
300
250
llite.share1-ffff881067f9b800.stats=
200
snapshot_time
1408263676.546716 secs.usecs
read_bytes
589388 samples [bytes] 0 2147479552 258867698600
150
write_bytes
1025093126 samples [bytes] 1 4194304 637173439272
osc_read
3880442 samples [bytes] 8 1048576 3667025741928
100
osc_write
640640 samples [bytes] 5 1048576 637252863026
ioctl
17938 samples [regs]
50
open
90267 samples [regs]
0
close
90239 samples [regs]
mmap
10523 samples [regs]
seek
6997546 samples [regs]
fsync
16 samples [regs]
readdir
48874 samples [regs]
setattr
252 samples [regs]
truncate
12 samples [regs]
getattr
2097773 samples [regs]
create
3465 samples [regs]
450
link
1 samples [regs]
unlink
2890 samples [regs]
400
statfs
2069 samples [regs]
350
alloc_inode
8423 samples [regs]
getxattr
1025105141 samples [regs]
300
inode_permission
229899278 samples [regs]
lustre-1.8.9
lustre-2.5.2
Fixed DDN branch
mmap() read perforamnce improvements
Lustre-1.8.9
Fixed DDN branch
250
200
150
100
50
0
32K
41
©2012 DataDirect Networks. All Rights Reserved.
128K
512K
Block size
41
1024K
ddn.com
Performance improvements for the
same region of a shared file
Single client's
processes
Fix and optimization for parallel read
(no cache)
A reference database file
2000
1800
1600
Application is not MPI, but a
lot of single applications refer
to a reference file and does
mapping operation with it
1400
1200
2X
2X
8X
lustre-1.8.9
1000
lustre-2.5.2
800
Fixed DDN branch
600
9X
400
2X
12
X
200
0
4KB single
4KB parallel
1MB single
1MB parallel
Sanger Institute in UK hit similar performance regressions with lustre-2.5.2 client.
After they applied our patches, significant reduced job's elapsed time.
24 hours (Fixed DDN Lustre branch) from 40 hours (lustre-2.5.2).
42
©2012 DataDirect Networks. All Rights Reserved.
42
ddn.com
Optimization of performance under
heavy CPU loads
►
►
►
All client's CPU utilizations are quite high and Job
scheduler allocates next jobs very efficiently.
Found Lustre-2.5 performance regressions under
heavy CPU loads.
A lot of Java applications seems not be doing good
memory management. And Lustre client consumes
memory.
• Several implementation of applications are based on old
architecture. (assuming everything put on the cache?)
• Reduced buffer caches for Lustre changed more disk access
rater than using caches...
43
©2012 DataDirect Networks. All Rights Reserved.
43
ddn.com
Where ideas
become
reality
Large bulk I/O size support
As far as it monitors server side IO stats, a
lot of large sequential I/O are coming.
# cat /proc/fs/lustre/obdfilter/*/brw_stats
snapshot_time:
1406696961.271996 (secs.usecs)
read
|
write
pages per bulk r/w
rpcs % cum % | rpcs
% cum %
1:
1091416
1
1
| 681741
2
2
2:
62166
0
1
| 164562
0
2
4:
96568
0
1
| 60799
0
2
8:
115945
0
1
| 10054
0
2
16:
170813
0
1
| 11361
0
2
32:
242152
0
1
| 18944
0
2
64:
444827
0
2
| 37609
0
2
128:
861561
0
3
| 107677
0
3
256:
99436837 96 100
| 32549912 96 100
read
|
write
discontiguous pages
rpcs % cum % | rpcs
0:
102060933 99 99
| 33641331
1:
177850
0 99
| 1196
0
2:
27307
0 99
|
39
0
3:
10447
0 99
|
27
0
4:
5502
0 99
|
16
0
% cum %
99 99
99
99
99
99
- snip –
SFA12K/Lustre Performance(Write)
(/w large bulk I/O patches)
35
30
25
20
1MB I/O
15
4MB I/O
10
16MB I/O
5
0
320 x NLSAS
400 x NLSAS
SFA12K/Lustre Performance(Read)
(/w large bulk I/O patches)
40
35
30
25
read
|
write
discontiguous blocks
rpcs % cum % | rpcs
0:
102029460 99 99
| 31615681
1:
208894
0 99
| 2026762
2:
27592
0 99
| 131
0
3:
10511
0 99
|
25
0
4:
5549
0 99
|
9
0
1MB I/O
20
% cum %
93 93
6 99
99
99
99
4MB I/O
15
16MB I/O
10
5
0
320 x NLSAS
- snip 44
Engineering Technical
|
Conference
2014
©2012 DataDirect Networks.
All Rights Reserved.
400 x NLSAS
44
ddn.com
Performance results after reworking all
improvements (1/3 scale test case)
Job Started
Job Finished
Lustre-1.8.9
Fixed Lustre Branch
After rework :
5 hours faster
than lustre-1.8
Job Finished
45
©2012 DataDirect Networks. All Rights Reserved.
45
ddn.com
Summary
►
Learned I/O patterns of genomic analysis applications.
• Each job's IO access patterns are not difficult, but it makes
complexity with genomic analysis pipeline.
►
We've done performance monitoring, analysis and
optimization of Lustre.
• Realtime Lustre performance monitoring helps performance
analysis and performance optimization.
►
There are still many areas we can optimize
• Still remained a lot of legacy and old system architectures base.
• Changing the applications are really hard (researchers are busy
and I/O optimization is not main work ) but adapting and
optimizing for their applications are possible.
46
©2012 DataDirect Networks. All Rights Reserved.
46
ddn.com
Trouble shooting
►
Using two real examples to discuss/illustrate
troubleshooting Lustre:
1.
Performance Issue during commissioning
2.
47
3 bugs in a mature running systems
©2012 DataDirect Networks. All Rights Reserved.
47
ddn.com
Generic Grafana graphing
48
©2012 DataDirect Networks. All Rights Reserved.
48
ddn.com
Grafana IOR run
49
©2012 DataDirect Networks. All Rights Reserved.
49
ddn.com
Opentsdb web interface
50
©2012 DataDirect Networks. All Rights Reserved.
50
ddn.com
© Copyright 2026 Paperzz