LSFS: A Lightweight Segment-Structured Local File System to Boost

LSFS: A Lightweight Segment-Structured Local File System to
Boost Parallel File System Performance
Peng Gu, Jun Wang
School of Electrical Engineering and Computer Science
University of Central Florida
{penggu,jwang}@eecs.ucf.edu
Rob Ross, Rajeev Thakur, Sam Lang, Rob Latham
Mathematics and Computer Science Division
Argonne National Laboratory
{rross,thakur,slang,robl}@mcs.anl.gov
Abstract
Parallel I/Os play an increasingly important role in today’s data intensive computing applications.
Many research efforts focus on parallel I/O performance optimizations at the application level, parallel
file system level, or in between, but very few at the local file system or disk level underneath. This lack of
attention to local file systems and disks imposes potential limitations that are otherwise avoidable on the
parallel I/O performance. In this paper, we design and implement a new lightweight, segment-structured
local file system called LSFS to boost the local file I/O performance for parallel file systems. Parallel
Virtual File System (PVFS2) is chosen as an example. Comprehensive experimental results show that
an LSFS-enhanced PVFS2 prototype system can outperform a traditional Linux-Ext3-based PVFS2 for
many applications and benchmarks, and in some tests as much as 230% in terms of I/O bandwidth.
Keyword
Parallel I/O, Parallel Virtual File System, LSFS, performance, algorithm
1
Introduction
As CPU speed grows rapidly during the last decades, the storage system has been far lagged behind. In
today’s high performance computing arena, super clusters consisting of thousands or more CPUs became
normal in reality. The data about top 500 supercomputers in the world indicates that most of the supercomputers are equipped with 512∼10240 CPUs, providing a sustained maximum throughput of 2.7∼280
TFlops [5]. In such systems with enormous computational power, their I/O systems encounter severe challenge to feeding those CPUs with sustained bulk volume of data.
Parallel I/O systems have come into play to address this high data access rate demanding problem raised
by many scientific applications. There are two schools of thought on building parallel file systems: shared
storage architectures and parallel file server architectures. The former model is generally considered a highend solution, featured in higher I/O system bandwidth and cost prohibitive to most end users. Whereas, the
lack of scalability due to the use of centralized shared storage become its intrinsic deficiency; On the other
hand, the latter model (shown in Figure 1) is often less reliable, but more scalable and cost-effective, and
thereby suitable for low-end solutions. While both parallel file system models are famous for their ability to
achieve extremely high I/O bandwidth for large sequential data accesses, yet small and noncontiguous file
I/O performance still remains problematic. This is especially true for parallel file server based architectures
whose fundamental I/O functionalities are performed by general purpose local file systems without any
knowledge of parallel applications.
Figure 1: Parallel File Based Architecture
Extensive research has been done to address the parallel I/O performance problem inherit in many parallel
I/O systems. For example, MPI-IO [20] library, providing a standard way to describe both structured and
unstructured I/O access patterns efficiently using an MPI-style interface and seeming to be the de-facto for
parallel I/O programming interface, is constantly being optimized by adding fancy features, such as data
sieving I/O [19], collective I/O [19] and list I/O [20]. Various optimization techniques are also advocated at
the application level, data description library level, and parallel file system level, but very few at local file
system or disk level.
2
This lack of attention to local file systems and disks results in three limitations in current parallel I/O
system solutions.
• In traditional local file systems, it is well known that even logically contiguous file I/O may become
noncontiguous physical I/O at the disk level due to file/disk fragmentation. To make the situation
worse, researches [13] show that parallel scientific workload often involves 1-D or 2-D strided access
patterns and such access patterns immediately translate into noncontiguous file I/O requests to the
file system, destroying the logical file access spatial locality commonly observed in traditional applications. These noncontiguous logical file I/O requests normally result in noncontiguous physical disk I/O
requests. As a result, most parallel I/O systems are experiencing increasingly intensive noncontiguous
disk I/O traffic. Without special optimizations, the clients inevitably experience an elongated average
I/O response time as a result of the increased number of noncontiguous physical I/O requests.
• I/O response time may be reduced by means of various prefetching algorithms. However, the noncontiguous disk block access pattern presented in most parallel applications could potentially degrade the
effectiveness of prefetching, even in the event of optimal prefetching algorithms. The reason is clear.
For traditional applications with sequential access patterns, when sufficient prefetching accuracy is
obtained, prefetching contiguous disk blocks not only reduces the wait time, but also make better use
of disk bandwidth by issuing larger requests. However, prefetching noncontiguous disk blocks does not
improve the disk I/O bandwidth utilization because of the dispersed small disk requests from the I/O
system’s perspective.
In this paper, we design and implement a lightweight segment-structured local file system called LSFS to
boost the parallel I/O performance by addressing the aforementioned problems in parallel file server based
architectures. Unlike previous parallel I/O enhancement techniques, this optimization is done between the
parallel file system level and the local file system level. In other words, it is more native from the storage
system’s perspective. At the same time, existing local file systems are typically not optimized for parallel
I/O applications. LSFS distinguish itself from other solutions by highlighting the following design choices: i)
a successor relationship based data grouping algorithm, named AEA (Section 2.4). ii) a disk-segment-based
prefetching technique. iii) a light weight file management technique.
By running several representative parallel I/O benchmarks and applications, including noncontig, IOR
and mpi-tile-io, on two Linux cluster testbeds, we observed that an LSFS enhanced PVFS2 prototype system
can improve I/O performance for many parallel I/O applications and benchmarks, especially in some cases
as much as 230% in terms of I/O bandwidth.
3
2
Locality-based Grouping in LSFS
The main idea of LSFS is to group noncontiguous file I/O requests on a segment based disk partition,
catering to the need of higher level parallel file systems. The design goal of LSFS is to reduce number of
I/Os by exploiting the access locality among LSFS files. In our current prototype system, LSFS functions
as read-only cache in front of the local file system. Since we are using a modular design, the functionality of
LSFS can be easily extended in the future when desired.
2.1
Two levels of Grouping in LSFS
We notice that there are two levels of temporal access locality in existing parallel applications: inter-file
access locality and intra-file access locality.
2.1.1
Inter-file level access locality
Small file access pattern bears access locality almost in all kinds of applications. For example, when we
compile our program, the same set of header files may be included many times by difference source files. In
a similar case, when we write a paper using Latex, every time when the work is compiled, the same set of
packages are included, and therefore the same set of files are accessed. This implies grouping multiple small
files together could facilitate large I/O operations. In this way, the entire group of files are fetched when any
one of the group members is accessed. We observe such pattern exists in current scientific computing [13, 23]
as well.
2.1.2
Intra-file level access locality
There are also times when the files are huge. This is especially true for scientific applications. Many
computation nodes may concurrently read the same large data file, e.g., a 3-D object database, but in
different portions. Researchers summarize several representative access patterns existing in today’s scientific
computing, such as simple strided, nested strided, random strided, sequential access, segmented access, tiled
access, Flash I/O access, unstructured mesh [17]. For example, Figure 2 illustrates a nested strided access
pattern resulting from column based access to a 2-D matrix. In this example, the innate access locality
between adjacent column elements disappears at both the virtual file level and the local file level. Hence,
here is the problem, parallel file systems may understand this access patterns but does not have control on
the disk data organizations; while local file systems do have some control on the data layout, they are blind
to the parallel I/O access patterns. LSFS, on the other hand, designed as an modular plug-in for parallel
file systems, is capable of observing the parallel I/O access pattern and controlling its own local disk layout
4
Databeingaccessed
1
2
3
7
8
9 10 11 12
4
5
6
13 14 15 16 17 18
2-DMatrix
19 20 21 22 23 24
25 26 27 28 29 30
31 32 33 34 35 36
VirtualFile
(MemoryView)
LocalFile
(DiskView)
1
2
3
4
5
6
7
8
9 10
..
31 32 33 34 35 36
1
2
3
4
7
8
9
10 11 12
13 14 15
16 17 18
19 20 21
22 23 24
25 26 27
28 29 30
31 32 33
34 35 36
I/Oserver1
5
6
I/Oserver2
Figure 2: Strided Access Pattern
at the same time. Thus it is capable of bridging this gap between parallel file systems and local file systems
by grouping file portions of strong access locality together and put them into adjacent disk blocks.
2.2
State-of-the-art in Prefetching and Grouping
In order to track down the access relationship between a group of small files or discrete portions of a large
file, we try to develop an efficient, accurate and adaptive grouping and prefetching algorithm based on the
online data access successor relationship discovery techniques.
Previous prefetching work at the disk level and at the file level can be classified into three categories:
predictive prefetching [12], application-controlled prefetching [16], and compiler-directed prefetching [18]. In
this taxonomy, LSFS is a predictive prefetching.
Among the latest advancement in the area of predictive prefetching, Long et al. introduced several file
access predictors including First Successor, Last Successor, Noah (Stable Successor) [6], Recent Popularity
(also known as Best j-out-of-k) and Probability-based Successor Group Prediction [22, 7]. The differences
among these predictors are summarized as follows.
First Successor
The file that followed file A the first time A was accessed is always predicted to follow
A.
5
Last Successor The file that followed file A the last time A was accessed is predicted to follow A.
Noah (Stable Successor)
Similar to Last Successor, except that a current prediction is maintained; and
the current prediction is changed to last successor if last successor was the same for S consecutive accesses
where S is a predefined parameter.
Recent Popularity (Best j-out-of-k)
Based on last k observations on file A’s successors, if j out of
those k observations turn out to target the same file B, then B will be predicted to follow A.
Probability-based Successor Group Prediction
Based on file successor observations, a file relationship
graph is built to represent the probability of a given file following another. Based on the relationship graph,
the prefetch strategy builds the prefetching group by following steps:
1. The missed item is first added into the group.
2. Add the items with the highest conditional probability under the condition the items in the current
prefetching group were accessed together.
3. Repeat step 2 until the group size limitation is met.
It can be observed that all but the last one of the aforementioned prefetching methods are designed to
predict only one successor per given access. In addition to their work, group prefetching algorithm is also
studied by different research groups for various applications.
In 2006, Memik et al. proposes a new programming model called Multicollective I/O that tries to access
a group of unique files with only one MPI-IO alike request [14]. In concept, it expands the collective I/O to
allow data from multiple files to be requested in a single I/O request, in contrast to allowing only multiple
segments for a single file being specified together. In order to do that, the authors propose two different
heuristics to detect the access pattern. However, their work was done at the MPI-IO library level other than
parallel or local file systems. Also, they only discover the inter-file access locality.
2.3
Grouping algorithm considerations
Unique to the aforementioned data grouping algorithms, our locality-based grouping algorithm is capable of
discovering group access locality at both file level and disk level. To achieve this goal, in LSFS design, both
small files (inter-file access locality) and large file portions (intra-file access locality) are treated the same
way as individual LSFS files. These LSFS files are then represented as nodes in a graph, which maintains the
group access relationship among the individual LSFS files. It should be noted that our grouping algorithm
6
is not intended for prefetching continuous data in a large file, rather, it is designed to improve the I/O
performance for accessing small files and noncontinuous portions of a large file by grouping them together
for future prefetching.
To develop a grouping algorithm for our special purpose, there are several specific issues to bear in mind.
Accuracy First of all, the grouping algorithm has to be accurate. Data prefetching based on inaccurate
grouping information can easily introduce overhead significant enough to balance out its advantage. The
possible overhead of mis-grouping is at least two fold: waste of disk bandwidth and pollution of the LSFS
group prefetching cache. As a result, accuracy is always the first priority through our grouping algorithm
design,
Efficiency Secondly, the grouping algorithm must be efficient. We are designing an online grouping and
prefetching algorithm rather than an off-line one. An online system has to be efficient in terms of CPU cycles
for practical use. Even if the prefetching algorithm turns out to be highly accurate, an inefficient algorithm
will not justify for an online usage.
Adaptivity Thirdly, this grouping algorithm must be adaptive to the workload. The reason is that I/O
requests do not always arrive in constant and regular patterns over time. Therefore, the grouping algorithm
have to adapt well to the workload changes. In addition, the capability of adaptiveness has to considered at
design stage.
2.4
Grouping algorithm design
According to our previous discussion in Section 2.2, First Successor, Last Successor, Noah and Recent
Popularity algorithms are all for single successor prediction other than for group prediction, thus do not fit
our purpose. Probability-based Successor Group Prediction appears to be a good candidate. Unfortunately,
it is not a practical solution due to its spacial complexity: it requires unbounded memory to hold the entire
online I/O trace in order to calculate the probabilities of one file being the successor of another.
Since we are actually working on data grouping which requires ideal accuracy, we choose to use Recent
Popularity algorithm to build the relationship graph. By adjusting the parameters of j and k in best-j-outof-k algorithm, we can actually control the accuracy of the prediction algorithm. Once the graph is built,
we need to divide the nodes into groups for prefetching. Still, for the purpose of prediction accuracy, we
adopt the most strict graph partitioning algorithm, Strongly Connected Component algorithm. Our new
algorithm is called AEA to represent Accuracy, Efficiency and Adaptivity.
7
Next, we will examine the complexity of AEA algorithm, a combination of Recent Popularity for building
the relationship graph and Strongly Connected Component for graph partitioning.
More formally, the problem of Recent Popularity can be rephrased as follows: given a trace T consisting
of a sequence of elements, try to build a relationship graph G using best-j-out-of-k algorithm according to T .
In order to find out the computational complexity, a naive algorithm to construct graph G is given below.
Best-j-Out-Of-k-Graph (T,j,k)
1 Build a set S containing all the unique elements of T
2 Initiate an empty queue Q[Si ] of fixed length k for each elements of S Si
3
4
5
6
7
8
for i ← 1 to |T | − 1
do Enqueue T [i + 1] to proper queue Q[Sh ] such that Sh = Ti
G←∅
for ∀(m, n) such that S[n] appears at least j times in Q[Sm ]
do add edge E(m, n) to graph G
return G
In this algorithm, suppose the size of T is n and the size of S is m (m ≤ n), we calculate the algorithm
complexity step by step. For step 1, the time required to go through the entire sequence of T is O(n). Step
2 requires O(k × m) time. Step 4 is composed of searching Sh for Ti (O(m) time) and enqueueing Ti+1
(O(1) time). Since it is repeated n − 1 times, the total time required by step 3 and 4 is O(m × n). Step
5 obviously requires constant time (O(1)). For step 6, the maximum number of iteration will be
m×k
j
and
step 7 requires O(1) time. Hence, the total time required for step 6 and 7 is O( m×k
j ). Summing up the time
required by each step, we get an accumulative time complexity of O(m × n). Hence, we conclude that this
is a polynomial time algorithm.
For the computational complexity of Strongly Connected Component algorithm, an well known algorithm
of O(|V | + |E|) complexity is already given based on Depth First Search. The proof can be found in [11].
Hence, we know that this is a linear time algorithm.
In summary, the total cost for building and partitioning the graph is the combinational cost of aforementioned two algorithms, in other words, a polynomial time complexity.
3
LSFS system implementation
LSFS is designed to improve I/O system performance for parallel file server model based parallel file systems,
such as PVFS2. Therefore, we choose PVFS2 as our developing platform. While it might be possible to
apply similar ideas on shared disk server model, this topic is out of the scope of this paper.
8
3.1
Compact Segment I/O of LSFS
A new I/O technique called compact segment I/O is invented to bridge the mapping gap between logical
file layout and physical disk layout by facilitating the large-only disk I/O operations. Disk segment is the
atomic unit that contains a large chunk of data in the segment-structured storage subsystem in LSFS. Future
references to any files stored in LSFS result in one or more disk segment accesses to enforce a large-only
I/O fashion. The compact segment I/O employs the aforementioned AEA algorithm to form groups at
runtime. Existing parallel I/O techniques such as data sieving, collective I/O, list I/O, HDF5 and NetCDF
optimize the I/O performance at the file system level, library level or application level in the parallel storage
system software layer hierarchy. The block level optimization becomes an oversight while appears to be ever
important. LSFS implements a compact segment I/O technique that works at both file system level and
block level to best improve the performance. In effect, LSFS can be viewed as a complementary scheme
to higher level aforementioned I/O solutions. The highlights of the compact segment I/O are listed in the
following aspects.
• Small PVFS2 files accessed by multiple processes are grouped together into segments as the basic access
unit, exploiting the inter-file access locality.
• Hot portions of large PVFS2 files accessed by multiple processes are grouped together into segments,
exploiting the intra-file access locality.
Figure 3(a) and Figure 3(b) illustrate two common scenarios — accessing multiple large file portions and
small files by the compact segment I/O respectively.
F1
File
F2
F3
File
Dynamic Grouping
Dynamic Grouping
Segment
Segment
Disk
Disk
F1
(a) Example 1: Multiple file regions
F2
F3
(b) Examples 2: Multiple small files
Figure 3: Compact segment I/O examples
9
3.1.1
Lightweight and Custom Designs
Similar to our previous works UCFS [22], and TFS [21], LSFS adopts a lightweight, custom design method
to realize sustained high-performance. The rationale is that tuning to the specifics at hand could potentially
increase the performance of applications and parallel file systems. As a lightweight file system, LSFS implements only a subset of functions of local file system while oversights other unnecessary functions for the sake
of high performance. In addition, LSFS customizes the design of file metadata and data layout and buffer
cache prefetching and replacement algorithms to best the performance. More specifically, LSFS realizes the
following lightweight and custom designs:
1. LSFS does not implement file permission check. It only updates file metadata during its creation and
deletion.
2. LSFS manages a flat name space and replaces lengthy name lookup by hashing.
3. File metadata I/O occurs in memory and thus most metadata maintenance overhead is eliminated.
3.2
Interface between LSFS and PVFS2 Storage Module
LSFS is designed to directly improve read performance rather than write. It filters out read requests from
writes before they reach the local file system. Read requests may be satisfied by in-memory group cache,
LSFS disk, or local file system at LSFS’s discretion; write requests are simply directed to the local file system.
The reduced contention of read traffic could indirectly improve the write performance in turn and thus the
overall PVFS2 performance.
LSFS Implementation LSFS can be implemented either at OS kernel level or user space, as shown in
Figure 4(a) and Figure 4(b) respectively. In former case of Figure 4(a), LSFS is actually composed of two
parts, a user-level interface to PVFS2 and a kernel module which gets the actual works done (file/block
access locality identification, cache management, storage management, etc.). The advantages of the kernel
implementation lie in several facts: LSFS could possibly corporate better with the I/O buffer cache; it
knows more information about the underlying hardware, and thus might better tune its internal parameters
for optimization. However, this approach has limited portability and customization room. As discussed in
previous section, the key of lightweight design favors a user-level implementation. We implement LSFS as
a user-level component that is integrated into PVFS2 as a read-only cache, shown in Figure 4(b). Central
to the implementation of LSFS, how to interact with both PVFS2 server module and underlying local file
10
(a) LSFS Kernel Implementation
(b) LSFS User-level Implementation
Figure 4: LSFS architecture
system is crucial. According to the design of PVFS2, the storage management module is named trove. At
the time of this writing, the only implementation version of trove is called DBPF (i.e., DataBase Plus File).
We directly hacked the DBPF code so that LSFS is able to take over the requests dispatched to DBPF.
Similar to DBPF, LSFS follows an assumption that the underlying local file system complies with single
UNIX specification version 3 [4] and thus recognizes read/write and/or lio listio file operations. Literally,
current LSFS version is an extended DBPF module. An software architecture of an LSFS-enhanced PVFS2
is shown in Figure 5 (Applications and higher layer libraries are also shown for clarity).
DBPF adopts two different storage methods—UNIX file and Berkeley DB [15] (version 3 or 4) to store
byte streams and key/value pairs (two of the most important storage objects in PVFS2), respectively.
It employs a set of file service functions to handle I/O requests, among which the most representative
ones are dbpf bstream read list and dbpf bstream write list. Both functions are main entries for read/write
operations. LSFS revamps those service functions to intercept the requests and process them before they
reach the local file system if possible. LSFS uses the legacy read/write system calls to interface with the raw
disk partition.
11
For an efficient implementation, LSFS works closely with DBPF and underlying local file system or raw
disk. It monitors the requests received by DBPF and transforms the requests into its own segment format. If
the segment is cached in LSFS, the request is satisfied immediately. Otherwise LSFS checks (this checking is
done in memory by searching its own segment look up table) with its raw disk partition to see if the segment
is on the disk or not. If the request is neither cached nor resident on disk, the request is handed over back
to DBPF. The source code of our LSFS implementation is available upon request for interested readers.
Figure 5: PVFS2/LSFS software architecture
4
Evaluation Methodology and Results
This section presents the evaluation methodology that we use to quantitatively study the performance of
LSFS as compared to the baseline system.
4.1
Experimental Setup
Our evaluation were performed on two clusters. The first cluster is called CASS consisting of four nodes for
in-house prototyping and performance evaluation purposes. Each of the cluster nodes in CASS is equipped
with a 2.8 GHz Pentium 4 Xeon processor, 1 GB RDRAM, and 200 GB∼400 GB hard drive. They are
connected with a Linksys SD2008 8-port Gigabit switch and running the Fedora Core 5 Linux operating
system. PVFS2 version 1.4.0 is installed on each of these nodes with the same configuration and each
assumes multiple roles: PVFS2 server, PVFS2 client and compute node. All PVFS2 files were created with
the default 64 KByte strip size, summing up to a 256 KByte stripe across all the four server nodes. In
addition, MPICH2 version 1.0.3 is installed as the MPI library.
To further test the scalability, we also run some parallel I/O benchmarks on a Chiba City cluster at
Argonne National Laboratory of Department of Energy. The Chiba City cluster is a 512 CPU cluster
12
primarily running Linux. The cluster also includes a set of eight storage nodes for our performance testing
purpose. Each storage node is an IBM Netfinity 7000s with 500 MHz Xeons, 512 MB of RAM, and 300GB of
disk. The interconnect for high performance communication is 64-bit Myrinet. All systems in the cluster are
on the Myrinet. The software stack is the same as CASS. The only difference is that the version of PVFS2
installed on this cluster is 1.5.1 other than 1.4.0, representing the most up-to-date version of PVFS2 at the
time of our testing. We did not make any changes when installing LSFS on two different PVFS2 versions,
which indicates a good compatibility of LSFS with different PVFS2 versions.
4.2
Baseline System
To fairly evaluate the performance for the LSFS enhanced PVFS2 prototype, we choose to compare LSFS
solution with that based on Ext3 file system, the most popular native file system for Linux-based clusters
where PVFS2 resides. For brevity, in the rest of this paper, PVFS2 with Ext3 support is referred as
PVSF2/Ext3, while PVFS2 with LSFS as PVFS2/LSFS. During our tests, we have both PVFS2/Ext3 and
PVFS2/LSFS running at the same time. These two processes share the same memory space. To be precise,
both processes run in the same software and hardware environment, making a fair comparison. For the
same purpose, we create two directories, /mnt/pvfs2 and /mnt/lsfs, to serve as the mounting points for
PVFS2/Ext3 and PVFS2/LSFS, respectively, so that the files on these file systems are independent of each
other. Next we describe our benchmarks in detail and compare the results of both I/O systems.
4.3
Benchmarks and Metrics
To conduct the evaluation study as comprehensively as possible, we first run two representative parallel file
system benchmarks on top of both PVFS2 server platforms, PVFS2/LSFS and PVFS2/Ext3. Our test suite
consists of noncontig and mpi-tile-io, both among the most popular parallel I/O benchmarks to the best of
authors’ knowledge [9, 8, 24, 26, 10, 25]. Second, we run another parallel I/O benchmark IOR [3] to gauge
the effectiveness and scalability of PVFS2/LSFS prototype system on the Chiba City cluster.
As far as the I/O intensive parallel application is concerned, the most important thing users would be
interested in is the aggregated I/O bandwidth (I/O bandwidth in brief for the rest of the paper) of the entire
parallel file system, which is the sum of the I/O bandwidth of all storage nodes. As a result, we choose the
I/O bandwidth as the major metric during the evaluation. Since LSFS does not improve PVFS2 directly
for write performance, we pay more attention to read than write. On the other hand, we also present the
write performance in several cases to show that current version of LSFS can sometimes improve the write
13
performance and at least does not compromise write performance.
4.3.1
Noncontig Benchmark
Noncontig benchmark is public available parallel I/O benchmark from Parallel I/O Benchmarking Consortium [1]. It is designed for studying I/O performance using various I/O methods, I/O characteristics and
noncontiguous I/O cases. This benchmark is capable of testing three I/O characteristics (region size, region
count and region spacing) against two I/O methods (list I/O, collective I/O) in four I/O access cases (contiguous memory contiguous disk, noncontiguous memory contiguous disk, contiguous memory noncontiguous
disk, and noncontiguous memory noncontiguous disk).
Since our focus is on disk accesses, whether the memory region is contiguous or not is of less interests.
For simplicity, we only consider the contiguous-memory cases. As far as file access continuity is concerned,
we study both contiguous file accesses and non-contiguous file accesses. Our conjecture is that, for noncontiguous file accesses, PVFS2 could significantly benefit from LSFS which is designed to improve the
non-contiguous access performance at physical disk-level; for contiguous file accesses, the corresponding disk
accesses may still become non-contiguous because of the gap between the file system and the disk, and
therefore PVFS2 can still benefit a lot from the group access feature of LSFS.
At the same time, we notice that the results for collective I/O method and independent I/O method also
exhibit some differences. Here we also present both of them in Figure 6 to allow a side-by-side comparison.
However, the impact of collective I/O on storage system is beyond the scope of this study.
I/O Bandwidth (MB/s)
140
120
Collective Read
Collective Write
Independent Read
Independent Write
100
80
60
40
20
0
PVFS2/Ext3
PVFS2/LSFS
Figure 6: noncontig I/O performance comparison
In Figure 6, PVFS2/LSFS exhibits a read performance gain of 92% to 230% over the PVFS2/Ext3
baseline system. These results suggest that, by combining highly related accesses into groups, LSFS can
14
boost the I/O performance dramatically. The simple strided pattern of noncontig benchmark issues I/O
accesses to the same data repeatedly, which translates into better group access locality that LSFS is able to
take the best advantage of.
In order to better understand the source of benefit—group access locality, we further investigate LSFS
application-level buffer cache—group cache behaviors. In our experiments, we add a small fragment of
debug codes into LSFS to collect the group cache utilization statistics on-the-fly when PVFS2/LSFS server
is running. The results are then written into a server log file. After several runs of the noncontig benchmark,
we extract the following results out of our server log file, shown in Table 1.
Number of
processes
2
4
8
16
Hits
40515
40802
41932
42093
Misses
64
67
75
82
Total
40579
40869
42007
42175
Hit
rate(%)
99.84
99.83
99.82
99.80
Table 1: Group access locality analysis
Through many round of experiments during our testing, we found that to achieve the best balance
between prediction accuracy and computational complexity, the most suitable values for j and k to be 2 and
10, respectively. The results in Table 1 are obtained based on this observation.
The reason why our group cache works extremely well is that the parameter veclen (means vector length,
it determines the region count as well as region spacing for the I/O in the benchmark) fits well with LSFS
configuration parameters. In view of this, we also studied the sensitiveness of PVFS2/LSFS on the vector
length parameter in noncontig benchmark. The corresponding results are shown in Figure 7.
250
PVFS2/LSFS Read
PVFS2/Ext3 Read
Bandwidth (MB/s)
200
150
100
50
0
4
8
16
32
64
Vector Length
128
256
512
Figure 7: noncontig read performance with different vector lengths
15
In Figure 7, both PVFS2/LSFS and PVFS2/Ext3 show some extent of sensitivity to the variation of
vector length. When the vector length is relatively small (4 to 8), the spacing between requests is not
wide enough to exclude the internal data sieving operation of MPICH2, and thus the resulting difference
between PVFS2/LSFS and PVFS2/Ext3 is negligible; however, when the vector length reaches 16, visible
improvement on the read bandwidth was obtained by PVFS2/LSFS over PVFS2/Ext3, indicating the ineffectiveness of data sieving operations at this point. When the vector length increases further, exceeding
the cache capacity of both systems (they reside in the same set of nodes, thus sharing the same amount
of memory and other hardware resources), the performance of PVFS2/Ext3 drops more significantly than
that of PVFS2/LSFS, which is an interesting indicating of ineffectiveness of local file system prefetching
compared with LSFS’s locality grouping based prefetching.
4.3.2
Disk Fragmentation Study
One important motivation of LSFS is to solve such a problem that, a large contiguous file I/O being split
into multiple small noncontiguous disk I/Os, in parallel file systems. By employing a compact segment I/O
technique and a lightweight/custom design methodology, LSFS aims to help PVFS2 sustain high local file
I/O performance even when disks become quite fragmented .
To verify our conjecture, we manually make disks differently fragmented and redo our experiments by
running noncontig benchmark. Since higher disk utilization suggests higher level of fragmentation for newly
created files, the experiments were repeatedly carried out when PVFS2 disks are 20%, 40%, 60%, 80%, 90%
and 95% full respectively. To make a disk bear a targeted space utilization and a fragmentation rate, we
create a large number of small garbage files and then randomly delete some of them1 . This is to emulate
certain empirical scientific computing environments. For example, high energy physics applications often
frequently have small files created and delete during a short time period. It is anticipated that a nearly
full disk is more fragmented than an ample one, and that the disk space utilization rate is proportional to
its fragmentation rate. We use the command df to monitor real-time disk utilization rates. The results for
contiguous file accesses and non-contiguous file accesses are shown in Figure 8(a) and Figure 8(b) respectively.
In both cases, when the disk utilization reaches 90%, the performance degradation for PVFS2/Ext3
become much more severe than that of PVFS2/LSFS. Again, this confirms our conjecture that LSFS would
retain its high performance even if the files on local file system is fragmented because disk fragmentation is
1 First, we recursively copy a bunch of large directories onto those file systems until those partitions are completely full.
Second, we delete the files according to their size—small ones first, until the desired disk utilizations are achieved. As a result,
the large files created afterwards are most likely fragmented.
16
350
300
300
250
Bandwidth (MB/s)
Bandwidth (MB/s)
350
PVFS2/LSFS Read
200
PVFS2/Ext3 Read
150
100
PVFS2/LSFS Read
PVFS2/Ext3 Read
250
200
150
100
50
50
0
0
20%
40%
60%
80%
90%
95%
20%
Disk space utilization
(a) Contiguous File Read Performance on Fragmented Disks
40%
60%
80%
Disk space utilization
90%
95%
(b) Non-contiguous File Read Performance on Fragmented Disks
Figure 8: Disk Fragmentation Impact on File Read Performance
less likely to occur on LSFS disks. In contrast, for local file system with high level of disk fragmentation
ratio, even large contiguous file I/O would be split into multiple small noncontiguous disk I/Os and thus
seriously degrade the local file I/O performance of PVFS2.
Another observation is that, Figure 8(a) represents the read performance results for contiguous requests
at file level, while Figure 8(b) represents the same results for non-contiguous ones. If we compare there
two figures side by side, we can find that the results of contiguous requests always outperforms that of
non-contiguous ones. The implication is that more severely discrete file level requests leads to more discrete
requests on disk.
4.3.3
Mpi-tile-io Benchmark
Mpi-tile-io benchmark is another synthetic benchmark that from the Parallel I/O Benchmarking Consortium
benchmark suite [2]. It has been widely used in many parallel I/O related researches [?, 26, 25]. The
application implements tile access on a two-dimensional dataset, with overlapped data between adjacent
tiles. The size of the tiles and the overlap ratio is adjustable. Collective I/O support is optional in this
application. We studied both cases with and without collective I/O support in our experiments.
Figure 9 presents the I/O bandwidth results collected when running mpi-tile-io on PVFS2/Ext3 and
PVFS2/LSFS, respectively. To show that the performance impact of collective I/O is orthogonal to that
of LSFS, we plot the results for both collective I/O and independent I/O (non-collective) method in this
figure. In both cases, PVFS2/LSFS exhibits approximately 15% higher I/O bandwidth than PVFS2/Ext3.
This derives from the advantageous ability of PVFS2/LSFS to group multiple small I/Os into larger ones for
17
read access. A possible reason why the performance gain is not that “conspicuous” lies in that mpi-tile-io,
as a read-once/write-once dominant application, cannot make good use of the grouping feature of LSFS in a
repeated fashion. As explained before, the lukewarm performance gain derives from its large-only file/disk
I/O design. An interesting fact we observed here is that even write performance sees marginal benefits from
LSFS. A possible explanation is that even thought we do not explicitly make any improvement over write,
it might be improved indirectly since the contention of read traffic is reduced, therefore leaving more disk
bandwidth to writes.
120
I/O Bandwidth (MB/s)
100
Collective Read
Collective Write
Independent Read
Independent Write
80
60
40
20
0
PVFS2/Ext3
PVFS2/LSFS
Figure 9: mpi-tile-io I/O bandwidth comparison
4.3.4
IOR Benchmark
The IOR software is developed at Lawrence Livermore National Laboratory [3]. It is designed for benchmarking parallel file systems using POSIX, MPI-IO, or HDF5 interfaces. To test the scalability of our LSFS
design, we run the IOR benchmark on Chiba City cluster by varying the number of clients from 8 to 256.
From the results shown in Figure 10, one can observe that, both PVFS2/LSFS and PVFS2/Ext3 show
performance gain approximately proportional to the number of clients (i.e., processes) within a certain
range (less than 128). The I/O bandwidth of both systems degrade when the clients reach 256 because the
PVFS2 servers are saturated. On the other hand, PVFS2/LSFS outperforms PVFS2/Ext3 by up to 132% in
terms of read performance (in unit of I/O bandwidth). Even when the PVFS2 becomes saturated, namely
the clients reach 256, PVFS2/LSFS still beats PVFS2/Ext3 by 39%. Another observation is that the benefit
gain of PVFS2/LSFS over PVFS2/Ext3 from group access locality is not decreased by increasing the number
of processes that issue the read requests. This indicates PVFS2/LSFS obtains a good scalability.
18
60
PVFS2/LSFS Read
Bandwidth (MB/s)
50
PVFS2/Ext3 Read
40
30
20
10
0
8
16
32
64
128
256
Number of Clients
Figure 10: IOR benchmark bandwidth comparison
5
Conclusions
In this paper, we present the design and implementation of the first version of Lightweight Segment-structured
Local File System (LSFS) prototype for a state-of-the-art parallel file system—PVFS2. LSFS employs a novel
compact segment I/O technique and a grouping based prefetching methodology to resolve two limitations
existing in today’s parallel file systems: a mapping gap between local file I/O and disk I/O, and a mismatch
between general-purpose local file systems and scientific computing applications. Running several parallel
I/O benchmarks in different Linux-based cluster testbeds, we collect comprehensive experimental results and
conclude that an LSFS-enhanced PVFS2 prototype system can significantly outperform a Linux-Ext3-based
PVFS2 by up to 230% in terms of I/O bandwidth.
References
[1] http://www-unix.mcs.anl.gov/pio-benchmark/.
[2] http://www-unix.mcs.anl.gov/∼rross/pio-benchmark/.
[3] http://www.llnl.gov/icc/lc/siop/downloads/download.html.
[4] http://www.unix.org/version3/online.html.
[5] http://www.top500.org, Nov 2006.
[6] Amer, A., and Long, D. D. E. Noah: Low-cost file access prediction through pairs. In Proceedings of the
20th IEEE International Performance, Computing and Communications Conference (IPCCC ’01) (April 2001),
IEEE, pp. 27–33.
[7] Amer, A., Long, D. D. E., and Burns, R. C. Group-based management of distributed file caches. In ICDCS
(2002), p. 525.
[8] Ching, A., Choudhary, A., Coloma, K., keng Liao, W., Ross, R., and Gropp, W. Noncontiguous
I/O accesses through MPI-IO. In Proceedings of the Third IEEE/ACM International Symposium on Cluster
Computing and the Grid (Tokyo, Japan, May 2003), IEEE Computer Society Press, pp. 104–111.
[9] Ching, A., Choudhary, A., keng Liao, W., Ross, R., and Gropp, W. Noncontiguous I/O through PVFS.
In Proceedings of IEEE International Conference on Cluster Computing (Chicago, Illinois, USA, July 29 2002).
19
[10] Ching, A., Choudhary, A., keng Liao, W., Ross, R., and Gropp, W. Efficient structured data access in
parallel file systems. In Proceedings of the IEEE International Conference on Cluster Computing (Hong Kong,
China, Dec. 2003), IEEE Computer Society Press, pp. 326–335.
[11] Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein., C. Introduction to Algorithms, second
edition ed. MIT Press and McGraw-Hill, 2001, ch. Section 22.5, pp. 552–557.
[12] Lei, H., and Duchamp, D. An analytical approach to file prefetching. In 1997 USENIX Annual Technical Conference (Anaheim, California, USA, Jan. 1997). http://www.cs.columbia.edu/˜lei/resume.html# publications.
[13] Madhyastha, T. M., and Reed, D. A. Learning to classify parallel input/output access patterns. IEEE
Transactions on Parallel and Distributed Systems 13, 8 (Aug. 2002), 802–813.
[14] Memik, G., Kandemir, M. T., keng Liao, W., and Choudhary, A. N. Multicollective I/O: A technique
for exploiting inter-file access patterns. TOS 2, 3 (2006), 349–369.
[15] Olson, M. A., Bostic, K., and Seltzer, M. Berkeley DB. In Proceedings of the FREENIX Track (FREENIX99) (Berkeley, CA, June 6–11 1999), USENIX Association, pp. 183–192.
[16] Patterson, R. H., Gibson, G. A., Ginting, E., Stodolsky, D., and Zelenka, J. Informed prefetching and
caching. In High Performance Mass Storage and Parallel I/O: Technologies and Applications. IEEE Computer
Society Press and Wiley, New York, NY, 2001, ch. 16, pp. 224–244.
[17] Shorter, F. Design and analysis of a performance evaluation standard for parallel file systems. Master’s thesis,
Clemson University, 2003.
[18] Skeppstedt, J., and Dubois, M. Compiler controlled prefetching for multiprocessors using low-overhead traps
and prefetch engines. J. Parallel Distrib. Comput 60, 5 (2000), 585–615.
[19] Thakur, R., Gropp, W., and Lusk, E. Data sieving and collective I/O in ROMIO. In Proc. of the 7th
Symposium on the Frontiers of Massively Parallel Computation (Feb. 1999), IEEE, pp. 182–189.
[20] Thakur, R., Gropp, W., and Lusk, E. On implementing MPI-IO portably and with high performance. In
Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems (May 1999), pp. 23–32.
[21] Wang, J., Cai, H., and Hu, Y. A light-weight, temporary file system for large-scale web servers. In Proceedings
of 12th International Conference on World Wide Web(WWW12) (Budapest, HUNGARY, May 2003).
[22] Wang, J., Min, R., Zhu, Y., and Hu, Y. UCFS - a user-space, high performance, customized file system for
web proxy servers. IEEE Transactions on Computers 51, 9 (Sep 2002), 1056–1073.
[23] Wang, Y., and Kaeli, D. Profile-guided I/O partitioning. In Proceedings of the 2003 International Conference
on Supercomputing (ICS-03) (New York, June 23–26 2003), ACM Press, pp. 252–260.
[24] Wu, J., Wyckoff, P., and Panda, D. Supporting efficient noncontiguous access in PVFS over Infiniband. In
Proceedings of the IEEE International Conference on Cluster Computing (Hong Kong, China, Dec. 2003), IEEE
Computer Society Press, pp. 344–351.
[25] Wu, J., Wyckoff, P., and Pandac, D. Pvfs over infiniband: Design and performance evaluation. In
Proceedings of the 2003 International Conference on Parallel Processing (32th ICPP’03) (Kaohsiung, Taiwan,
Oct. 2003), IEEE Computer Society, pp. 125–132.
[26] Yu, W., Liang, S., and Panda, D. K. High performance support of parallel virtual file system (PVFS2) over
quadrics. In ICS (2005), Arvind and L. Rudolph, Eds., ACM, pp. 323–331.
20

Download Report

LSFS: A Lightweight Segment-Structured Local File System to Boost

Paperzz.com

Your Paperzz