View - OhioLINK Electronic Theses and Dissertations Center

Distributed Metadata Management for Parallel Filesystems
A Thesis
Presented in Partial Fulfillment of the Requirements for the Degree
Master of Science in the Graduate School of The Ohio State
University
By
Vilobh Meshram, B.Tech(Computer Science)
Graduate Program in Computer Science and Engineering
The Ohio State University
2011
Master’s Examination Committee:
Dr. D.K. Panda, Advisor
Dr. P. Sadayappan
c Copyright by
Vilobh Meshram
2011
Abstract
Much of the research in storage systems has been focused on improving the scale
and performance of the data-access throughput that read and write large amounts of
file data. Parallel file systems do a good job of scaling large file access bandwidth by
striping or sharing I/O resources across many servers or disks. However, the same
cannot be said about scaling file metadata operation rates.
Most existing parallel filesystems choose to concentrate all the metadata processing load on a single server. This centralized processing can guarantee correctness,
but it severely hampers scalability. This downside is becoming more and more unacceptable as metadata throughput is critical for large scale applications. Distributing
metadata processing load is critical to improve metadata scalability when handling
huge number of client nodes. However, in such a distributed scenario, a solution to
speed up metadata operations has to address two challenges simultaneously, namely
scalability and reliability.
We propose two approaches to solve the challenges mentioned above for metadata
management in parallel filesystems with a focus towards reliability and scalability
aspects. As demonstrated by experiments, the approach to solve the problem of distributed metadata management achieves significant improvements over native parallel
filesystems by large margin for all the major metadata operations. With 256 client
processes, our approach to solve the problem of distributed metadata management
ii
outperforms Lustre and PVFS2 by a factor of 1.9 and 23, respectively, to create directories. With respect to stat() operation on files, our approach is 1.3 and 3.0 times
faster than Lustre and PVFS.
iii
This work is dedicated to my parents and my sister
iv
Acknowledgments
I consider myself extremely fortunate to have met and worked with some remarkable people during my stay at Ohio State. While a brief note of thanks does not do
justice to their impact on my life, I deeply appreciate their contributions.
I begin by thanking my adviser, Dr. Dhabaleswar K.Panda. His guidance and
advice during the course of my Masters studies have shaped my career. I am thankful
to Dr. P. Sadayappan for agreeing to serve on my Master’s examination committee.
Special thanks to Xiangyong Ouyang for all the support and help. I would also
like to thank Dr.Xavier Besseron for his insightful comments and discussions which
helped me to strengthen my thesis. I am especially grateful to Xiangyong, Xavier
and Raghu and I feel lucky to have collaborated closely with them. I would like to
thank all my friends in the Network Based Computing Research Laboratory for their
friendship and support.
Finally, I thank my family, especially my parents and my sister. Their love, action,
and faith have been a constant source of strength for me. None of this would have
been possible without them.
v
Vita
April 18, 1986 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Amravati, India
2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.Tech., Computer Science,
COEP, Pune University,
Pune, India.
2007-2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software Development Engineer,
Symantec R&D India
2010-2011 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Research Associate,
The Ohio State University
Publications
Research Publications
Vilobh Meshram, Xavier Besseron, Xiangyong Ouyang, Raghunath Rajachandrasekar
and Dhabaleswar K. Panda Can a Decentralized Metadata Service Layer benefit
Parallel Filesystems?. accepted in IASDS 2011 workshop in conjunction with Cluster
2011
Vilobh Meshram, Xiangyong Ouyang and Dhabaleswar K. Panda Minimizing Lookup
RPCs in Lustre File System using Metadata Delegation at Client Side. OSU Technical
Report OSU-CISRC-7/11-TR20, July 2011
Raghunath Rajachandrasekar, Xiangyong Ouyang, Xavier Besseron, Vilobh Meshram
and Dhabaleswar K. Panda Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging?. to appear in Reselience 2011 workshop in conjunction with
Euro-Par 2011
Fields of Study
vi
Major Field: Computer Science and Engineering
Studies in High Performance Computing: Prof. D. K. Panda
vii
Table of Contents
Page
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xii
1.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
1.2
1.3
1.4
.
.
.
.
.
.
.
.
3
5
8
10
10
12
14
15
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.1
2.2
16
19
1.5
1.6
2.
Parallel Filesystems . . . . . . . . . . . . . .
Metadata Management in Parallel Filesystems
Distributed Coordination Service . . . . . . .
Motivation of the Work . . . . . . . . . . . .
1.4.1 Metadata Server Bottlenecks . . . . .
1.4.2 Consistency management of Metadata
Problem Statement . . . . . . . . . . . . . . .
Organization of Thesis . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
Metadata Management approaches . . . . . . . . . . . . . . . . . .
Scalable filesystem directories . . . . . . . . . . . . . . . . . . . . .
viii
3.
Delegating metadata at client side (DMCS) . . . . . . . . . . . . . . . .
22
3.1
3.2
3.3
.
.
.
.
.
.
.
.
.
.
.
.
.
22
24
25
25
26
30
31
31
32
34
34
36
37
Design of a Decentralized Metadata Service Layer for Distributed Metadata Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.4
3.5
4.
4.1
4.2
4.3
4.4
4.5
RPC Processing in Lustre Filesystem . . . . . . . . . . . . . .
Existing Design . . . . . . . . . . . . . . . . . . . . . . . . . .
Design and challenges for delegating metadata at client side .
3.3.1 Design of communication module . . . . . . . . . . . .
3.3.2 Design of DMCS approach . . . . . . . . . . . . . . . .
3.3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . .
3.3.4 Metadata revocation . . . . . . . . . . . . . . . . . . .
3.3.5 Distributed Lock management for DMCS approach . .
Performance Evaluation . . . . . . . . . . . . . . . . . . . . .
3.4.1 File Open IOPS: Varying Number of Client Processes .
3.4.2 File Open IOPS: Varying File Pool Size . . . . . . . .
3.4.3 File Open IOPS: Varying File path Depth . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Detailed design of Distributed Union FileSystem (DUFS) . . . . . .
4.1.1 Implementation Overview . . . . . . . . . . . . . . . . . . .
4.1.2 FUSE-based Filesystem Interface . . . . . . . . . . . . . . .
ZooKeeper-based Metadata Management . . . . . . . . . . . . . . .
4.2.1 File Identifier . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Deterministic mapping function . . . . . . . . . . . . . . . .
4.2.3 Back-end storage . . . . . . . . . . . . . . . . . . . . . . . .
Algorithm examples for Metadata operations . . . . . . . . . . . .
4.3.1 Reliability concerns . . . . . . . . . . . . . . . . . . . . . .
Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Distributed coordination service throughput and memory usage experiments . . . . . . . . . . . . . . . . . . . . . . . .
4.4.2 Scalability Experiments . . . . . . . . . . . . . . . . . . . .
4.4.3 Experiments with varying number of distributed coordination service servers . . . . . . . . . . . . . . . . . . . . . . .
4.4.4 Experiment with different number of mounts combined using
DUFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.5 Experiments with different back-end parallel filesystems . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
39
41
42
43
44
45
45
46
46
48
49
52
52
55
58
60
5.
Contributions and Future Work . . . . . . . . . . . . . . . . . . . . . . .
62
5.1
62
63
Summary of Research Contributions and Future Work . . . . . . .
5.1.1 Delegating metadata at client side . . . . . . . . . . . . . .
5.1.2 Design of a decentralized metadata service layer for distributed
metadata management . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
64
66
List of Tables
Table
Page
1.1
LDLM and Oprofile Experiments . . . . . . . . . . . . . . . . . . . .
7
1.2
Transaction throughput with a fixed file pool size of 1,000 files . . . .
11
1.3
Transaction throughput with varying file pool . . . . . . . . . . . . .
12
1.4
Transaction throughput with varying file pool . . . . . . . . . . . . .
12
3.1
Metadata operation rates with different underlying storage . . . . . .
30
xi
List of Figures
Figure
Page
1.1
Basic Lustre Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2
Zookeeper Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.3
Example of consistency issue with 2 clients and 2 MetaData servers .
13
3.1
Design of DMCS approach . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2
File open IOPS, Each Process Accesses 10,000 Files . . . . . . . . . .
35
3.3
File open IOPS, Using 16 Client Processes . . . . . . . . . . . . . . .
36
3.4
Time to Finish open, Using 16 Processes Each Accessing 10,000 Files
37
4.1
DUFS mapping from the virtual path to the physical path using File
Identifier (FID) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
DUFS overview. A, B, C and D show the steps required to perform an
open() operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.3
Sample physical filename generated from a given FID . . . . . . . . .
46
4.4
Algorithm for the mkdir() operation . . . . . . . . . . . . . . . . . . .
47
4.5
Algorithm for the stat() operation . . . . . . . . . . . . . . . . . . . .
47
4.6
ZooKeeper throughput for basic operations by varying the number of
ZooKeeper Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.2
xii
4.7
Zookeeper memory usage and its comparison with DUFS and basic
FUSE based file system memory usage . . . . . . . . . . . . . . . . .
51
Scalability experiments with 8 Client nodes and varying number of
client processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
Scalability experiments with 16 Client nodes and varying number of
client processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.10 Operation throughput by varying the number of Zookeeper Servers .
56
4.11 File operation throughput for different numbers of back-end storage .
57
4.12 Operation throughput with respect to the number of clients for Lustre
and PVFS2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.8
4.9
xiii
Chapter 1: INTRODUCTION
High-performance computing (HPC) is an integral part of today’s scientific, economic, social, and commercial fabric. We depend on HPC systems and applications
for a wide range of activities such as climate modeling, drug research, weather forecasting, and energy exploration. HPC systems enable researchers and scientists to
discover the origins of the universe, design automobiles and airplanes, predict weather
patterns, model global trade, and develop life-saving drugs. Because of the nature of
the problems that they are trying to solve, HPC applications are often data-intensive.
Scientific applications in astrophysics (CHIMERA and VULCAN2D), climate modeling (POP), combustion (S3D), fusion (GTC), visualization, astronomy, and other
fields generate or consume large volumes of data. This data is on the order of terabytes and petabytes and is often shared by the entire scientific community. Today’s
computational requirements are increasing at a geometric rate that involves large
quantities of data. While the computational power of microprocessors has kept pace
with Moore’s law as a result of increased chip densities, performance improvements in
magnetic storage have not seen a corresponding increase. The result has been an increasing gap between the computational power and the I/O subsystem performance
of current HPC systems. Hence, while supercomputers keep getting faster, we do
1
not see a corresponding improvement in application performance, because of the I/O
bandwidth bottleneck.
The parallel file system do a good job in improving the data throughput rate by
striping or sharing I/O resources across many servers and disks. The same cannot be
said about metadata operations. Every time a file is opened, saved, closed, searched,
backed up or replicated, some portion of metadata is accessed. As a result, metadata
operations fall in the critical path of a broad spectrum of applications. Studies [20,23]
show that over 75% of all filesystem calls require access to file metadata. Therefore,
efficient management of metadata is very crucial for the overall system performance.
Even though the modern distributed file systems architectures like Lustre [4],
PVFS [10] and Google File System [13] separate the management of metadata from
the storage of the actual file data all the namespace is managed by a centralized
metadata server. These architectures have proven to easily scale the storage capacity
and bandwidth. However, the management of metadata remains a bottleneck.
Recent trends in high-performance computing have also seen a shift toward distributed resource management. Scientific applications are increasingly accessing data
stored in remote locations. This trend is a marked deviation from the earlier norm
of co-location of application and its data. So in such a distributed environment the
management of metadata becomes even more difficult as the important points of reliability, consistency and scalability need to be taken care of. As we saw above that in
most of the parallel filesystem single metadata server manages the entire namespace,
new approaches need to be designed to take care of distributed metadata management. Few of the parallel filesystems have a design for better metadata management
in order to overcome the problem faced by the single point of metadata bottleneck
2
but considering the complexity of distributed metadata management the effort is still
in progress.
Our research focuses on addressing these two problem. We have examined the
existing paradigms and suggested better alternatives. In the first part we focus on
an approach for the Lustre filesystem to overcome the problem of single point of
bottleneck. In the second part we design and evaluate our scheme for distributed
metadata management for parallel filesystems with the primary aim of improving the
scalability of the filesystem while maintaining the reliability and consistency aspects.
1.1
Parallel Filesystems
Parallel Filesystems are mostly used in High Performance Computing environments which deals with or generates, massive amount of data. Parallel Filesystems
usually separate the processing of metadata from data. Some parallel file systems,
e.g., Lustre have a separate Metadata Server to handle metadata operations whereas
some parallel filesystems, e.g., PVFS may keep the metadata and data at the same
place. Lets consider the case of Lustre. Lustre is a POSIX compliant, open-source
distributed parallel filesystem. Due to the extremely scalable architecture of the Lustre filesystem, Lustre deployments are popular in scientific supercomputing, as well
as in the oil and gas, manufacturing, rich media, and finance sectors. Lustre presents
a POSIX interface to its clients with parallel access capabilities to the shared file
objects. Lustre is an object-based filesystem. It is composed of three components: a
Metadata server (MDS), object storage servers (OSSs), and clients. Figure 1.1 illustrates the Lustre architecture. Lustre uses block devices for file data and metadata
storage and each block device can be managed by only one Lustre service. The total
3
data capacity of the Lustre filesystem is the sum of all individual OST capacities.
Lustre clients access and concurrently use data through the standard POSIX I/O
system calls. MDS provides metadata services. Correspondingly, a MDC (metadata
client) is a client of those services. One MDS per filesystem manages one metadata
target (MDT). Each MDT stores file metadata, such as file names, directory structures, and access permissions. OSS (object storage server) exposes block devices and
serves data. Correspondingly, OSC (object storage client) is client of the services.
Each OSS manages one or more object storage targets (OSTs), and OSTs store file
data objects.
LDAP Server
configuration information,
network connection details
& security management
Clients
directory operations,
meta-data & concurrency
file I/O
& locking
Meta-Data Server
(MDS)
Object Storage Targets
(OST)
frecovery,
file status
& file creation
Figure 1.1: Basic Lustre Design
4
1.2
Metadata Management in Parallel Filesystems
Parallel filesystems like Lustre and Google Filesystem separate out from the classical distributed file systems like NFS, etc. in a way that they separate out the
management of metadata from actual file data. In classical distributed filesystems
like NFS the server has to manage both data and metadata. This increases the load on
the server and limits the performance and scalability of the filesystem. Parallel filesystem store the metadata on a separate server known as the metadata server(MDS).
Lets consider the example of Lustre Filesystem. In terms of on-disk storage of metadata, the parallel file system keeps additional information known as Extended Attributes(EA) apart from normal file metadata attributes like inode, etc. EA information along with the normal file attributes is handed over to the client in case of
the getattr or lookup operation. So when the client wants to perform an actual I/O,
the client is aware of which servers to talk to or to understand how the file is striped
amongst servers. From the MDS point of view, each file is composed of multiple data
objects striped on one or more OSTs. A file objects layout information is defined
in the extended attribute (EA) of the inode. Essentially, EA describes the mapping
between file object id and its corresponding OSTs. This information is also known
as striping EA.
So if the stripe size is 1MB, then this would mean that [0,1M), [4M,5M) are
stored say as object x, which is on OST p; [1M, 2M), [5M, 6M) are stored say
as object y, which is on OST q; [2M,3M), [6M, 7M) are stored say as object z,
which is on OST r. Before reading the file, a client will query the MDS via MDC
and be informed that it should talk to OST p, OST q, OST r for this operation.
This information is structured in so-called LSM, and client side LOV (logical object
5
volume) is to interpret this information so client can send requests to OSTs. Here
again, the client communicates with OST through a client module interface known as
OSC. Depending on the context, OSC can also be used to refer to an OSS client by
itself. All client/server communications in Lustre are coded as an RPC request and
response. Within the Lustre source, this middle layer is known as Portal RPC, or
ptl-rpc which translates and interprets filesystem requests to and from the equivalent
form of RPC request and response, and the LNET module to finally put that down
onto the wire.
Most of the parallel file system follow such a kind of architecture where a single
metadata server manages the entire namespace. So in a scenario when the load on
MDS increases, the performance of MDS slows down which slows down the performance of the entire file system. The MDS consists of many important components
like the Lustre Distributed Lock Manager (LDLM) which occupies a major chunk
of the processing time at the Lustre. We performed experiments using the Oprofile
tool to profile the Lustre code to understand the amount of time consumed by the
LDLM module. The experiment was performed on 8 client nodes. Figure 1.1 shows
the amount of time consumed by the Lock Manager module at the MDS. In such a
kind of environment where a single metadata manages the entire namespace most of
the time is spent in the LDLM module and in communication. By communication we
mean sending a blocking AST to the client holding a valid copy and then invalidating
the local cache at that client. Also, allowing only a single Metadata Target (MDT) in
a filesystem means that Lustre metadata operations can be processed only as quickly
as a single server and its backing filesystem can manage. In order to improve the
6
performance and scalability of parallel filesystem the effort has been made in the
direction of distributed metadata management.
Table 1.1: LDLM and Oprofile Experiments
File
Percentage
ldlm/ldlm lockd.c
0.0044
ldlm/ldlm inodebits.c
0.0044
0.7104
ldlm/ldlm internal.h
ldlm/ldlm lib.c
1.4341
0.0132
ldlm/ldlm lock.c
ldlm/ldlm pool.c
18.5729
ldlm/ldlm request.c
1.8754
5.3526
ldlm/ldlm resource.c
Clustered Metadata Server (CMD) is an approach proposed by the Lustre community for distributed metadata management. With CMD functionality, multiple
MDS can provide a single file system’s namespace jointly, storing the directory and
file metadata on a set of MDT. Clustered Metadata (CMD) means there are multiple active MDS servers in one Lustre file system, the MDS workload can be shared
among several servers, so that the metadata performance will be significantly improved. Although CMD will improve the performance and scalability of Lustre, it
also brings some difficulties. The most complex one are recovery, consistency and
reliability. In CMD, one metadata operation may need to update several different
MDSs. To maintain the consistency of the filesystem, the update must be atomic. If
the update on one MDS fails, all other updates must be rolled back to their original
states. To handle this, CMD uses a global lock. But a global lock slows down the
overall throughput of the filesystem.
7
1.3
Distributed Coordination Service
Google’s chubby [9] is a distributed lock service which gained wide adoption within
their data centers. Chubby lock service is intended to provide coarse-grained locking
as well as reliable storage for a loosely-coupled distributed system. The purpose of
the lock service is to allow its clients to synchronize their activities and to agree on
basic information about their environment. The primary goals include reliability,
availability to a moderately large set of clients, and easy-to-understand semantics;
throughput and storage capacity are considered secondary. Chubby’s client interface
is similar to that of a simple file system that performs whole file reads and writes,
augmented with advisory locks and with notification of various events such as file
modification. Chubby helps developers to deal with coarse-grained synchronization
within their systems, and in particular to deal with the problem of electing a leader
from among a set of otherwise equivalent servers. For example, the Google File
System [13] uses a Chubby lock to appoint a GFS master server, and Bigtable [11]
uses Chubby in several ways: to elect a master, to allow the master to discover the
servers it controls, and to permit clients to find the master. In addition, both GFS
and Bigtable use Chubby as a well-known and available location to store a small
amount of meta-data; in effect they use Chubby as the root of their distributed data
structures. The primary purpose of storing the root in chubby is improved reliability
and consistency aspect. So even in the event of a node failure, etc, we are still able
to view the contents of the directory due to the reliability provided by Chubby.
Apache Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill
many of the same roles for HDFS and other Hadoop infrastructure. ZooKeeper [14]
8
is a distributed, open-source coordination service for distributed applications. It exposes a simple set of interfaces that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance and naming. ZooKeeper allows distributed processes to coordinate with each other through
a shared hierarchical namespace which is organized similarly to a standard file system. The namespace consist of special nodes known as Znodes. Znodes do not store
data but they store configuration information. The ZooKeeper implementation puts
a premium on high performance, highly available, strictly ordered access. The strict
ordering means that sophisticated synchronization primitives can be implemented at
the client. ZooKeeper is replicated over a sets of hosts. ZooKeeper performs better
in a read intensive workload than in a write/update intensive workload [14].
Figure 1.2: Zookeeper Design
9
1.4
Motivation of the Work
Parallel file systems can easily scale bandwidth and improve performance by operating on data in parallel using strategies such as data striping, sharing resources, etc.
However, most parallel file systems do not provide the ability to scale and parallelize
metadata operations as it is inherently more complex than scaling the performance
of data operations [6]. PVFS provides some level of parallelism through distributed
metadata servers that manage different ranges of metadata. The Lustre community
has also proposed the idea of Clustered Metadata Server (CMD) to minimize the load
on a single Metadata Server, wherein multiple metadata servers share the metadata
processing workload.
1.4.1
Metadata Server Bottlenecks
The MDS is currently restricted to a single node, with a fail-over MDS that becomes operational if the primary server becomes nonfunctional. Only one MDS is
ever operational at a given time. This limitation poses a potential bottleneck as the
number of clients and/or files increase. IOZone [2] is used to measure the sequential file IO throughput, and Postmark [5] is used to measure the scalability of the
MDS performance. Since MDS performance is the primary concern of this research,
we discuss the Postmark experiment with more details. Postmark is a file system
benchmark that performs a lot of metadata intensive operations to measure MDS
performance. Postmark first creates a pool of small files (1KB to 10KB), and then
starts many sequential transactions on the file pool. Each transaction performs two
operations to either read/append a file or create/delete a file. Each of these operations happens with the same probability. The transaction throughput is measured to
10
approximate workloads on an Internet server. Table 1.2 gives the measured transaction throughput with a fixed file pool size of 1,000 files and different number of
transactions on this pool. The transaction throughput remains relatively constant
at varied transaction number. Since the cost for MDS to perform an operation does
not change at a fixed file number, this result is expected. Table 1.3 on the other
hand, changes the file pool size and measures the corresponding transaction throughput. By comparing the entries in Table 1.3 with their counterparts in Table 1.2,
it becomes clear that a large file pool results in a lower transaction throughput. We
also performed experiments by varying the number of transactions whereas keeping
the number of files in the file pool to be constant. Table 1.4 shows the details. As
seen from Table 1.4 for a constant file pool size and varying number of transactions
we don’t see a huge mutation in the transaction throughput. The MDS caches the
most recently accessed metadata of files (the inode of a file). A client file operation
requires the metadata information about that file to be returned by MDS. At larger
number of files in the pool, a client request is less likely to be serviced from the MDS
cache. A cache miss results in the MDS looking up its disk storage to load the inode
of requested file, which results in the lower transaction throughput in Table 1.3.
Table 1.2: Transaction throughput with a fixed file pool size of 1,000 files
Number of transactions
1,000
5,000
10,000
20,000
Transactions per second
333
313
325
321
11
Table 1.3: Transaction throughput with varying file pool
Number of files in pool
1,000
5,000
10,000
20,000
Number of transactions
1,000
5,000
10,000
20,000
Transactions per second
333
116
94
79
Table 1.4: Transaction throughput with varying file pool
Number of files in pool
5,000
5,000
5,000
5,000
1.4.2
Number of transactions
1,000
5,000
10,000
20,000
Transactions per second
333
316
318
313
Consistency management of Metadata
Majority of the distributed filesystems use a single metadata server. However, this
is a bottleneck that limits the operation throughput. Managing multiple metadata
servers brings many difficulties. Maintaining consistency between two copies of the
same directory hierarchy is not straightforward. We illustrate such a difficulty in
Figure 1.3.
We have two metadata servers (MDS) and we consider two clients that perform
an operation on the same directory at the same time. Client 1 creates the directory
d1 and client 2 renames the directory d1 to d2. As shown in Figure 1.3a, each client
performs its operation in the following order: first on the MDS1, then on the MDS2.
From the MDS point of view, there is no guarantee on the execution order of the
12
Client 1
Client 2
1. ’mkdir d1’ on MDS1
1. ’mv d1 d2’ on MDS1
2. ’mkdir d1’ on MDS2
2. ’mv d1 d2’ on MDS2
Time
(a) On the client side
MDS 1
MDS 2
1. ’mkdir d1’ from client1
2. ’mv d1 d2’ from client2
1. ’mv d1 d2’ from client2
2. ’mkdir d1’ from client1
Time
Result: d2
Result: d1
(b) On the MetaData server side
Figure 1.3: Example of consistency issue with 2 clients and 2 MetaData servers
requests since they are coming from different clients. As shown in Figure 1.3b, the
requests can be executed in a different order on each metadata server while still
respecting the ordering that the clients demand. In this case, the resulting states of
the two metadata servers are not consistent.
This small example highlights that distributed algorithms are required to maintain
the consistency between multiple metadata servers. Each client operation must appear to be atomic and must be applied in the same order on all the metadata servers.
For this reason, we decided to use a distributed coordination service like ZooKeeper
in the proposed metadata service layer. Such a coordination service implements the
required distributed algorithms in a reliable manner.
13
1.5
Problem Statement
The amount of data generated and consumed by high-performance computing applications is increasing exponentially. Current I/O paradigms and file system designs
are often overwhelmed by this deluge of data. To improve the I/O throughput to a
certain extent parallel file systems do a good job by incorporating features such as
sharing resources, data striping, etc. Distributed filesystems often dedicate a subset
of the servers for metadata management. File systems such as NFS [17], AFS [15],
Lustre [4], etc use a single metadata server to manage a globally shared file system
namespace. While simple, this design does not scale, resulting in the metadata server
becoming a bottleneck and a single point of failure.
In this thesis, we study and critique the current metadata management techniques
in parallel file systems by taking Lustre file system as our use case. We propose two
new designs for metadata management for parallel file systems. In the first part, we
present a design where we delegate the metadata at client side to solve the problem
of a single metadata server (MDS) becoming a bottleneck while managing the entire
namespace. We aim at minimizing the memory pressure at the MDS by delegating
some of the metadata to clients so as to improve the scalability of Lustre. In the second
part, we design a decentralized metadata service layer and evaluate its benefits in
parallel filesystem kind of environment. Decentralized metadata service layer takes
care of distributed metadata management with the primary aim of improving the
scalability of the filesystem while maintaining the reliability and consistency aspects.
Specifically, our research attempts to answer the following questions:
14
1. What are the challenges and problems associated with a single server managing
the entire namespace for a parallel file system?
2. How to solve the problem of minimizing the load on a single MDS by distributing
the metadata at Client side?
3. What are the challenges and problems associated with distributed metadata
management?
4. Can a distributed coordination service be incorporated into parallel filesystems
for distributed metadata management so as to improve the reliability and consistency aspects?
5. How will a decentralized metadata service layer perform with respect to various
metadata operations as compared to the basic variant of parallel filesystems
such as Lustre [4] and PVFS [10]?
6. Will a decentralized metadata service layer designed for distributed metadata
management do a good job in improving the scalability of parallel file system?
Will it help in maintaining the consistency and reliability of the file system?
1.6
Organization of Thesis
The rest of the thesis is organized as follows. Chapter 2 presents an overview
of the work in the area of parallel file systems with focus on metadata management
for parallel file systems. Chapter 3 proposes a distributed metadata management
technique by delegating metadata at client side. In Chapter 4 we explore the feasibility
of using a Distributed coordination service for distributed metadata management. We
conclude our work and present future research directions in Chapter 5.
15
Chapter 2: RELATED WORK
In this chapter, we discuss some of the current literature related to metadata
management in high performance computing environments. We highlight the drawbacks of current metadata managements paradigms in parallel filesystems and suggest
better design and algorithms for metadata management in parallel filesystem.
2.1
Metadata Management approaches
File system metadata management has long been an active area of research [15].
With the advent of commodity clusters and parallel file systems [4], managing metadata efficiently and in a scalable manner offers significant challenges. Distributed file
systems often dedicate a subset of the servers for metadata management. Mapping
the semantics of data and metadata across different, non overlapping servers allows
file systems to scale in terms of I/O performance and storage capacity.
File systems such as NFS [17], AFS [15], Lustre [4], and GFS [13] use a single
metadata server to manage a globally shared file system namespace. While simple,
this design does not scale, resulting in the metadata server becoming a bottleneck
and a single point of failure. File Systems like NFS [17], Coda [21] and AFS [15]
may also partition their namespace statically among multiple servers, so most of the
major metadata operations are centralized. The pNFS [12] allows for distributed
16
data but retains the concept of centralized metadata. Other parallel file systems like
GPFS [22], Intermezzo [7] and Lustre [4] use directory locks for file creation, with
the help of a distributed lock management (DLM) for better performance. Lustre
uses a single metadata server to manage the entire namespace. Lustre distributed
lock management module handles locks between clients and servers and local locks
between the nodes. The Lustre community has also mentioned the fact of a single
Metadata Server being a bottleneck in HPC kind of environments. So they came
up with the concept of Lustre Clustered Metadata Server (CMD). CMD is still a
prototype and there is no implementation for it till now. The original design for
CMD was proposed in 2008. In CMD files are identified by a global FID and are
assigned a metadata server, once we know the FID, we can directly deal with the
server. Getting this FID still requires a centralized/master metadata server and this
information is not redundant. So this will still involve a bottleneck at the Master node
in the CMD. Also the reliability and availability factor depends a lot on the Master
node in CMD. To mitigate the problems associated with a central metadata server,
AFS [15] and NFS [17] employ static directory subtree partitioning [24] to partition
the data namespace across multiple metadata servers. Each server is delegated the
responsibility of managing the metadata associated with a subtree. Hashing [8] is another technique used to partition the file system namespace. It uses a hash of the file
name to assign metadata to the corresponding MDS. Hashing diminishes the problem
of hot spots that is often experienced with directory subtree partitioning. The Lazy
Hybrid metadata management scheme [8,23] combines hierarchical directory management and hashing with lazy updates. Zhu et al. proposed using Hierarchical Bloom
Filter Arrays [25] to map file names to the corresponding metadata servers. They
17
used two levels of Bloom Filter Arrays with differing degrees of accuracy and memory overhead to distribute the metadata management responsibilities across multiple
servers. Ananth et. al explored multiple algorithms for creating files on a distributed
metadata file system for scalable metadata performance.
In past, in order to get more metadata mutation throughput, efforts were aimed
to mount more independent file systems into a larger aggregate, but each directory or
directory sub-tree is still managed by one metadata server. Some systems use cluster
metadata servers in pairs for fail-over, but not increased throughput. Some systems
allow any server to act as a proxy and forward requests to the appropriate server;
but this also does not increase metadata mutation throughput in a directory [3].
Symmetric shared disk file systems, that support concurrent updates to the same
directory use complex distributed locking and cache consistency semantics, both of
which have significant bottlenecks for concurrent create workloads, especially from
many clients working in one directory. Moreover, file systems that support client
caching of directory entries for faster read only workloads, generally disable client
caching during concurrent update workload to avoid excessive consistency overhead.
A recent trend among distributed file systems is to use the concept of objects to store
data and metadata. CRUSH [23] is a data distribution algorithm that maps object
replicas across a heterogeneous storage system. It uses a pseudo-random function to
map data objects to storage devices. Lustre, PanFS and Ceph [23] use various nonstandard object interfaces requiring the use of dedicated I/O and metadata servers.
Instead, our work breaks away from the dedicated server paradigm and redesigns
parallel file systems to use standards-compliant OSDs for data and metadata storage.
Also there has been work in the area of combining multiple partitions into a virtual
18
mount point. UnionFS (Linux official union filesystem in kernel mainline) [18] has
a lot of options but it does not support load balancing between branches. Most of
the file system which combine multiple partitions into a virtual mount work on a
single node to combine local partitions or directory. Also some union file system cannot extract the parallelism their default behavior is to use the first partition until it
reaches a threshold (based on the free space). It cannot attain the throughput even
after combining multiple mount points and is restricted by the throughput of the first
mounted partition.
2.2
Scalable filesystem directories
GPFS is a shared-disk file system that uses a distributed implementation of Fagin’s extendible hashing for its directories. Fagins extendible hashing dynamically
doubles the size of the hash-table pointing pairs of links to the original bucket and
expanding only the overgrowing bucket (by restricting implementations to a specific
family of hash functions). It has a two-level hierarchy: buckets (to store the directory
entries) and a table of pointers (to the buckets). GPFS represents each bucket as a
disk block and the pointer table as the block pointers in the directory’s i-node. When
the directory grows in size, GPFS allocates new blocks, moves some of the directory
entries from the overgrowing block into the new block and updates the block pointers
in the i-node. GPFS employs its client cache consistency and distributed locking
mechanism to enable concurrent accesses to a shared directory. Concurrent readers
can cache the directory blocks using shared reader locks, which enables high performance for read-intensive workloads. Concurrent writers, however, need to acquire
write locks from the lock manager before updating the directory blocks stored on the
19
shared disk storage. When releasing (or acquiring) locks, GPFS versions before 3.2.1
force the directory block to be ushed to disk (or read back from disk) inducing high
I/O overhead. Newer releases of GPFS have modied the cache consistency protocol
to send the directory insert requests directly to the current lock holder, instead of
getting the block through the shared disk subsystem [22]. Still GPFS continues
to synchronously write the directory’s i-node (i.e., the mapping state) invalidating
client caches to provide strong consistency guarantees. Lustre proposed clustered
metadata [1] service which splits a directory using a hash of the directory entries
only once over all available metadata servers when it exceeds a threshold size. The
effectiveness of this ”split once and for all” scheme depends on the eventual directory
size and does not respond to dynamic increases in the number of servers. Ceph is
another object-based cluster file system that uses dynamic sub-tree partitioning of
the namespace and hashes individual directories when they get too big or experience
too many accesses.
There has been some work in the area of designing a distributed indexing scheme
for metadata management. GIGA+ [16], examines the problem of scalable file system
directories, motivated by data-intensive applications requiring millions to billions of
small files to be ingested in a single directory at rates of hundreds of thousands of
file creates every second. GIGA+ builds directories with million/trillions of files with
high degree of concurrency. Compared to GPFS, GIGA+ allows the mapping state
to be stale at the client and never be shared between servers, thus seeking even more
scalability. Compared to Lustre and Ceph, GIGA+ splits a directory incrementally
as a function of size, i.e., a small directory may be distributed over fewer servers than
a larger one. Furthermore, GIGA+ facilitates dynamic server addition achieving
20
balanced server load with minimal migration. This work is interesting but is more
relevant in workloads where the directories have a huge fan-out factor or when the
application creates million/trillions of files in a single directory. In GIGA+ every
server only keeps a local view of the partitions it is managing and no shared state is
maintained and hence there are no synchronization and consistency bottlenecks. But
in case the server or the partition goes down or the root level directory gets corrupted
then the files wont be able to access.
21
Chapter 3: DELEGATING METADATA AT CLIENT SIDE
(DMCS)
In this chapter we focus on the problem faced by managing the entire namespace
by a central coordinator. We propose our design, delegating metadata at client side,
to handle the problem mentioned in section 1.4.1.
Before we delve into the design for delegating metadata at client side we first have
a look at the Remote Procedure Call (RPC) processing in Lustre Filesystem.
3.1
RPC Processing in Lustre Filesystem
When we consider the RPC processing in Lustre we also talk about how lock
processing works in Lustre [5, 7, 3, 18] and how our modifications can benefit to
minimize the number of LOOKUP RPC. Lets consider an example. Let us assume
client C1 wants to open the file /tmp/lustre/d1/d2/foo.txt to read. In this case
/tmp/lustre is our mount point. During the VFS path lookup, Lustre specific lookup
routine will be invoked. The first RPC request is lock enqueue with lookup intent.
This is sent to MDS for lock on d1. The second RPC request is also lock enqueue
with lookup intent and is sent to MDS asking inodebits lock for d2. The lock returned
is an inodebits lock, and its resources would be represented by the fid of d1 and d2.
The subtle point to note is, when we request a lock, we generally need a resource
22
id for the lock we are requesting. However in this case, since we do not know the
resource id for d1, we actually request a lock on its parent /, not on the d1 itself. In
the intent, we specify it as a lookup intent and the name of the lookup is d1. Then,
when the lock is returned, the lock is for d1. This lock is (or can be) different from
what the client requested, and the client notices this difference and replaces the old
lock requested with the new one returned. The third RPC request is a lock enqueue
with open intent, but it is not asking for lock on foo.txt. That is, you can open and
read a file without a lock from MDS since the content is provided by Object Storage
Target(OST). OSS/OST also has a LDLM component and in order to perform I/O
on the OSS/OST, we request locks from an OST. In other words, what happens at
open is that we send a lock request, which means we do ask for a lock from LDLM
server. But, in the intent data itself, we might (or not) set a special flag if we are
actually interested in receiving the lock back. And the intent handler then decides
(based on this flag), whether or not to return the lock. If foo.txt exists previously,
then its fid, inode content (as in owner, group, mode, ctime, atime, mtime, nlink,
etc.) and striping information are returned. If client C1 opens the file with the O
CREAT flag and the file does not exist, the third RPC request will be sent with open
and create intent, but still there will be no lock requests. Now on the MDS side, to
create a file foo.txt under d2, MDS will request through LDLM for another EX lock
on the parent directory. Note that this is a conflicting lock request with the previous
CR lock on d2. Under normal circumstances, a fourth RPC request (blocking AST)
will go to client C1 or anyone else who may have the conflicting locks, informing the
client that someone is requesting a conflicting lock and requesting a lock cancellation.
MDS waits until it gets a cancel RPC from the client. Only then does the MDS gets
23
the EX lock it was asking for earlier and can proceed. If client C1 opens the file
with LOV DELAY flag, MDS creates the file as usual, but there is no striping and
no objects are allocated. User will issue an ioctl call and set the stripe information,
then the MDS will fill in the EA structure.
3.2
Existing Design
In the following section we explain the existing approach followed by Lustre for
metadata management.
1. When client 1 tries to open a file it sends a LOOKUP RPC to MDS.
2. In step 2 the processing is done at the MDS side where the Lock Manager will
grant the lock for the resource requested by the Client. A second RPC will be
sent from the client to the MDS with the intent to create or open the file.
3. So at the end of step 2 client 1 will get the lock, extended attribute (EA)
information and other metadata details which the client will need to open the
file successfully.
4. Once the client gets the EA information and the lock handle, the client can
proceed ahead with the I/O operation.
5. MDS keeps track of the allocation by making use of queues. When multiple
clients try to access the same file the new client will wait in the waiting queue
till the time the original client who is the current owner of the lock releases
the lock. The MDS will then hand over the lock to the new client. Say client2
wants to access the same file which was earlier opened by client1 then client2
24
will be placed in the waiting queue. MDS will send a blocking AST to client1 to
revoke the lock granted. Client1 on receiving the blocking AST will release the
lock. In a scenario where client1 is down or something goes wrong, MDS will
wait for a ping timeout of 30 sec after which it will revoke the lock. Once the
lock is revoked the MDS will grant a lock handle and EA for the file to client2.
Client2 can proceed with I/O once it gets the lock handle and EA information.
3.3
Design and challenges for delegating metadata at client
side
Before moving ahead with the actual design of our approach we discuss how the
Lustre Networking works and the communication module that we developed to do
remote memory copy operations.
3.3.1
Design of communication module
We have designed a communication module for data movement. This communication module will bypass the normal Lustre Networking stack protocols and will help
to do remote memory data movement operations. We use the LNET API, originated
from Sandia Portals, to design the communication module. With our design for the
communication module we use the put and get API to do remote memory copy. The
remote copy can be used by clients to copy the metadata information from the client
to whom the metadata has been delegated by MDS. LNET identifies its peers using
LNET process id which consists of nid and pid. The nid identifies the id of the node,
and pid identifies the process on the node. For example, in the case of socket Lustre
Network Driver (LND) (and for all currently existing LNET LNDs), there is only
one instance of LNET running in the kernel space; the process id therefore uses a
25
reserved ID (12345) to identify itself. Portal RPC is a client for LNET layer. Portal RPC takes care of the RPC processing logic. A portal is composed of a list of
match entries (ME). Each ME can be associated with a buffer, which is described
by a memory descriptor (MD). ME itself defines match bits and ignore bits, which
are 64-bit identifiers that are used to decide if the incoming message can use the
associated buffer space. A memory buffer is described by a memory desciptor (MD).
Consider an example to illustrate the point. Say a client wants to read ten blocks of
data from the server. It first sends an RPC request to the server telling that it wants
to read ten blocks and it is prepared for the bulk transfer (meaning the bulk buffer is
ready). Then, the server initiates the bulk transfer. When the server has completed
the transfer, it notifies the client by sending a reply. Looking at this data flow, it is
clear that the client needs to prepare two buffers: one is associated with bulk Portal
for bulk RPC, the other one is associated with reply Portal.
3.3.2
Design of DMCS approach
In the following section we explain the design details of the client side metadata
delegation approach.
1. When client 1 tries to open a file it sends a LOOKUP RPC to MDS.
2. In Step 2, the processing is done at the MDS side where the Lock Manager will
grant the lock for the resource requested by the client. A second RPC will be
sent from the client to the MDS with the intent to create or open the file. So
at the end of step 2, C1 will get the lock handle, EA information and other
metadata details. Conceptually step 1 and 2 are similar to what we have in
the current Lustre Design but in our approach we modify the step 2 slightly.
26
Figure 3.1: Design of DMCS approach
In our approach, in step 2, we make an additional check at the MDS to see
if this is a first time access to the file. First time access means this will be
the first time when the metadata information for this file will be created on
the MDS and neither the metadata caches maintained by the kernel have the
metadata related information cached. So if this is a first time access then we
keep a data structure to keep track of who owns the file and do some validation
whether it is a first time access or not. So at the end of step 2 we get the needed
metadata information like the lock handle and EA information from MDS to
the C1. Communication module is used for one sided operations. We compute
the hash based on filename to speed up the lookup process at the MDS side.
We make use of the communication module for one sided operation like remote
memory read and remote memory write.
3. In Step 3, we expose the buffers with the information such as extended attributes, etc. that will be useful for clients who will subsequently access the file
27
that was opened by C1. We call such a client who exposes the needed buffer
information as the new owner of the file and use the term delegation client for
such clients.
4. Now when C2 tries to open the same file it performs an RPC in step 4 as we
do in step 1. We call C2 as the normal client.
5. In step 3, the normal client will do a lookup in the hash table at the MDS side
that we updated in step 1 and find that C1 is the owner of the file. So instead
of spending additional time at the MDS side we return the needed information
to C2.
6. In step 6, the normal client, C2, in our case will contact the delegator client, C1,
in our case, and fetch the information which was stored in the buffers exposed
by the delegator client for this specific file. We use our communication module
to speed up this process using one sided operations.
7. Once C2 gets the needed metadata information from C1, it can proceed with
the I/O operations.
The design can help in minimizing the request traffic at the MDS by delegating some load at the client. Distributed subtree partitioning and pure hashing are
methods used to distribute the namespace and workload among metadata servers in
existing distributed file systems. We use the combination of both these approaches.
By partitioning the namespace across the MDS and clients we minimize the load on
the MDS and by using a pure hashing scheme we compute the hash on the filename
and decide which bucket or server the file’s metadata has been delegated to. So when
28
a client accesses a file and if the file was created earlier by some client then this file
will have its entry in the hash table. So we compute the hash based on filename and
divert the client to the delegator client to get the metadata information. If the hash
table does not have the needed mapping information then it is the first time access
for the file. This design allows the authoritative copy of each metadata item to reside
on different clients, distributing the workload. In addition, the delegation record is
much smaller, allowing the MDS cache to be more effective as the number of files
accessed increases. This architecture will work well when the file pool size increases
and many clients are simultaneously accessing the files on the MDS. With this design
the workload will be distributed among the MDS and clients.
Secondly, with delegating metadata at the client side, instead of caching the complete metadata information at MDS, the MDS only stores a record to the delegation
client for each file, which greatly reduces the cache memory usage. The metadata is
distributed to all clients, so that no single client will become a bottleneck when many
clients are accessing many files.
Finally, when many client try to access large number of files, MDS is not able
to serve these requests from its memory due to the sheer amount of metadata to be
cached. Figure 1.3 shows the details. The MDS will be busy reading the requested
metadata that is widely dispersed in disk to load the blocks to memory. Meanwhile
metadata already in memory has to be evicted to make space for newly loaded metadata. Later requests for that evicted metadata have to be serviced from disk, which
aggravates the burden on MDS. Obviously the MDS becomes a bottleneck. Client side
metadata delegation distributes the responsibilities of many files to a lot of clients,
29
so that a request for any certain file hits the MDS disk at most once, and all following requests for that file can be serviced by a delegated client. No single node will
become bottleneck in the entire system. Although an additional network round trip
is incurred as a result of network redirection, this overhead is very tiny compared to
the disk access time on MDS. With high bandwidth and low latency interconnection
technologies such as Infiniband, this network round trip time is likely to be negligible.
We also studied the impact of the underlying storage and the transport on the
metadata performance. The experiment was run on 8 client nodes each running
Lustre 1.8.1. Our results show that even after we change the underlying data storage
medium with a faster device like SSD we cannot see a huge improvement in metadata
operation rates. Table 3.1 shows the details.
Table 3.1: Metadata operation rates with different underlying storage
Metadata operation
create()
open()
stat()
chmod()
unlink()
mkdir()
rmdir()
3.3.3
HDD/TCP
455
602
1,472
501
405
545
265
SSD/TCP
457
602
1,481
504
421
519
267
HDD/IB
893
1,441
3,131
1,171
843
1,221
609
SSD/IB
935
1,443
3,065
1,219
883
1,229
621
Challenges
While designing the client side metadata delegation approach we need to take care
of some challenges. In this section we state the challenges and describe the approach
taken to solve them.
30
3.3.4
Metadata revocation
Delegating metadata at client side will distribute the workload of the MDS to
client nodes. This will be very beneficial when many files are being accessed by many
clients (i.e., many-to-many file access pattern). However, when all clients are accessing
a single file (i.e., N-to-1 file access pattern) the hot-spot is simply moved from a very
powerful MDS to a relatively less powerful client. Therefore, provisions must be made
to attempt to not delegate a node when this situation may arise and to be able to
pullback a delegation if this situation arises unexpectedly (including updating the
information on the clients to know who has the authoritative metadata). We have
implemented the metadata revocation logic in the communication module which takes
care of this challenge by revoking the metadata when the client becomes a hot-spot.
3.3.5
Distributed Lock management for DMCS approach
To take care of consistency and reliability aspect we have designed a lock management scheme. Distributed locking scheme ensures consistency when many concurrent
clients are accessing the same file. One of the primary responsibility of the lock management scheme is to protect the shared data structure, i.e., the hash table maintained
at the MDS side as this hash table contains information of how the metadata is delegated and who currently owns the metadata. Consider a scenario where client C1
is the delegation client and holds the metadata for a specific file and then clients
C2-C10 are also accessing the same file and have got information from MDS that the
file’s metadata has been delegated to C1. So meanwhile when the clients C2-C10 are
performing the data movement, i.e., getting the file metadata information if there is
a metadata revocation request, then this metadata revocation request will be queued
31
till the time all the data movement operations are completed. In the existing design
of Lustre or any parallel file system whenever a file is closed, the client cache gets
flushed. In case a client is performing some operation on the file and meanwhile some
other client also wants a lock on the same file then depending on the lock compatibility matrix the original client has to flush its cache to the storage node. The lock
compatibility matrix checks which operations are compatible at the same time. For
example, if both the clients want to grab a read lock then both clients can proceed
ahead. So in case of conflicting compatible matrix entries the original client will flush
its cache and handover the lock to MDS. The MDS then will allocate lock to new
client and it can proceed ahead. The overhead involved in this step is high as the
process involves a lot of time spent in communication and also the cache needs to be
flushed.
In Lustre important file attributes like file size, file modification time, file access
time, etc, details are stored at the OSS. So when one client flushes its cache, the next
time the new client accesses the file data from the OSS, the Lock Manager at OSS
will make sure that the consistency is maintained.
3.4
Performance Evaluation
We have implemented our design into Lustre-1.8.1.1 to minimize the number of
RPC calls during a metadata operation. We conducted experiments to evaluate the
metadata operation performance with our proposed design. One node acts as the
Lustre Metadata Server (MDS), and two nodes are Lustre Object Storage Servers
(OSS). Lustre filesystem is mounted on other eight nodes which act as Lustre client
nodes. Each node runs kernel 2.6.18- 128.7.1.el5 with Lustre 1.8.1.1. Each node has
32
dual Intel Xeon E5335 CPU (8 cores in total) and 4GB memory. They are interconnected with 1GigE for general purpose networking. In our testing we configured
Lustre to use TCP transport in different runs. In order to measure the metadata operations performance such as open(), we have developed a parallel micro-benchmark.
We have extended the basic fileop testing tool coming with the IOZone [1] benchmark
to support parallel running with multiple processes on many Lustre client nodes. The
extended fileop tool creates a file tree structure for each process.
This tree structure contains X number of Level 1 directories, with each Level 1
directory having Y number of Level 2 directories. The total level of sub directories
can be configured at run time. Within each of the bottom level directory Z files are
created. By varying the size (fan-out) of each layer, we can generate different number
of files in a file tree. We have developed an MPI parallel program to start multiple
process on multiple nodes. Each process works on a separate directory to create its
aforementioned file tree. After that, each process walks through its neighbor’s file
tree to open each of the file in that sub-tree. This is to simulate the scenario that
multiple client processes take turns to access a shared pool of files. After that the
wall clock time on all the processes are summarized and the total IOPS for open
system call is reported. In order to perform the tests we created some number of
files from a specific client and those files were accessed subsequently by other clients
in an interleaving manner. Using the Postmark benchmark we could not simulate
the kind of the above scenario as in the Postmark benchmark [6] we create some N
number of files and as soon as the open/create or read/append operation is complete
the file pool is deleted. So we make use of the above mentioned micro benchmark
to perform the test and get the experimental results. In order to see the benefits
33
of the proposed approach in minimizing the RPC we carried out 3 different types of
test using our micro benchmark: 1) IOPS in open using our parallel benchmark for
different number of client processes, 2) IOPS in open using our parallel benchmark
for different number file pool sizes, and 3) Time spent in open for varying Path name.
3.4.1
File Open IOPS: Varying Number of Client Processes
In this test, we first create the aforementioned file tree each containing 10,000 files
for every client process, then let each process access its neighbor’s file tree. Figure
3 shows the aggregated number of IOPS for open system call on Lustre filesystem.
We vary the number of client processes from 2 to 16 which are evenly distributed on
8 client nodes. With 2 processes, only two client nodes are actually used. With 16
processes, 2 client processes run on each of the 8 client nodes. As seen in Figure 3,
the modified Lustre with MDCS improves the aggregated IOPS over the basic Lustre
significantly. Compared to the basic Lustre, our design reduces the number of RPC
calls in metadata operation path, therefore helps improve the overall performance.
With two client processes, the new approach (MDCS) promotes file open IOPS from
2,528 per second to 3,612 per second. With basic Lustre, on the other hand, the
Metadata Server has the potential to handle 8 concurrent client processes, given a
slightly higher file open IOPS with 8 concurrent client processes. When 16 processes
are used, however, MDS performance drops due to the high contention, similar to
what we see with MDCS the approach.
3.4.2
File Open IOPS: Varying File Pool Size
In this test we carry out similar basic steps as mentioned in 5.1. But in this test
we vary the number of files in each file tree per process, while using the same 16 client
34
Number of open() / seconds
4,000
Basic Lustre
MDCS: Modified Lustre
3,500
3,000
2,500
2,000
1,500
1,000
500
0
4
8
16
Number of Client Processes
Figure 3.2: File open IOPS, Each Process Accesses 10,000 Files
processes. We wanted to understand the significance of this factor while considering
the performance aspect into consideration. Figure 4 shows the experimental results.
It clearly demonstrates the benefits of our MDCS design. We observe that, by varying
the file pool size for a constant number of processes we dont see a huge deviation in the
number of IOPS for the open. We speculate that this is caused because the file pool
size used in our test isn’t big enough to stress the memory on MDS, such that most
of the files metadata information are stored in MDS’s memory cache. As a result,
the aggregated metadata operation throughput remains constant with different file
pool sizes. In our future study we will experiment with larger file pool to push the
memory limit of MDS.
35
Number of open() / seconds
4,000
Basic Lustre
MDCS: Modified Lustre
3,500
3,000
2,500
2,000
1,500
1,000
500
0
5,000
10,000
100,000
Number of Files in File Pool
Figure 3.3: File open IOPS, Using 16 Client Processes
3.4.3
File Open IOPS: Varying File path Depth
In this test we want to measure the performance benefit of the new MDCS approach when accessing files with different File path depth, i.e., number of components
in the file path. We start with creating a file tree for each of the client process containing 10,000 files, with file path depth to be 3 or 4. After that each process begins to
access files within its neighbor processs file tree. Figure 5 compares the time spent to
open one file, either with basic Lustre or with the MDCS modified Lustre filesystem.
First of all, it shows that MDCS can significantly reduce the time cost to open one
file by up to 33%. We also observe that pathname component factor has a significant
importance in the total cost of a metadata operation. Each file path component has
36
to be resolved using one RPC to the MDS, hence the deeper the file path leads to a
Time to Finish One Open() (milliseconds)
longer processing time.
25
Basic Lustre
MDCS: Modified Lustre
20
15
10
5
0
3
4
Number of Components in File Path (File Path Depth)
Figure 3.4: Time to Finish open, Using 16 Processes Each Accessing 10,000 Files
3.5
Summary
We have described a mechanism for minimizing the load on a single metadata
server for the Lustre Filesystem. A single metadata server managing the entire filesystem namespace is common in most of the parallel filesystem approaches to manage
metadata. In this design we minimize the load on the MDS and hence the memory
pressure on the MDS by delegating the metadata at the client side. We evaluated our
design and compared it with basic variant of Lustre. We can see that for metadata
37
operation like file open() the throughput increases as the number of client process
increases whereas with basic variant of Lustre the throughput decreases. We can see
similar behavior when the number of files in the file pool are increased. One of the
primary reason for the slowdown in basic variant of Lustre is that as the file pool size
goes on increasing the amount of file metadata to be kept in the MDS cache increases.
38
Chapter 4: DESIGN OF A DECENTRALIZED
METADATA SERVICE LAYER FOR DISTRIBUTED
METADATA MANAGEMENT
4.1
Detailed design of Distributed Union FileSystem (DUFS)
The core principle of Distributed Union FileSystem (DUFS) is to distribute the
load of the metadata operations across multiple distributed filesystems. DUFS provides a single POSIX-compliant filesystem abstraction to the user, without revealing
the multiple underlying filesystem mounts. With such an abstraction, the single
metadata server of the back-end distributed filesystem is not a bottleneck anymore.
However, as described in section 1.4.2, consistency has to be guaranteed across multiple clients which perform simultaneous metadata operations. This task is delegated
to the distributed coordination service - ZooKeeper [14].
DUFS maps each virtual filename, as seen by the user, to a physical path corresponding to one of the underlying filesystem mounts. A single-level indirection is
introduced with the use of a File Identifier (FID), which uniquely identifies each file.
Figure 4.1 shows a schematic view of this indirection level in our design. The mapping between the FID and the physical path is carried out using a universally-known
deterministic mapping function which every DUFS client is aware of. This mapping
39
Distributed
coordination
service
Virtual path
Deterministic
mapping
function
FID
Physical path
Figure 4.1: DUFS mapping from the virtual path to the physical path using File
Identifier (FID)
information is cached by ZooKeeper in a consistent manner. The second mapping
step does not require any coordination between clients. Consistency management at
the physical storage level is offloaded to the underlying filesystem.
This single-level indirection offers flexibility and allows to represent the contents
of a file independently of its name. Indeed, a filename can represent two different data
contents (after deletion and a new creation with the same name); and conversely, the
data contents can correspond to any filename (for instance, a rename operation). This
representation also makes rename operations and physical data relocation easier.
Finally, directories and directory-trees are considered as metadata only, so they
are not physically created on the back-end storage. Instead, the directory-tree information is maintained in-memory by ZooKeeper.
This single-level indirection offers flexibility and allows to represent the contents
of a file independently of its name. Indeed, a filename can represent two different data
contents (after deletion and a new creation with the same name); and conversely, the
data contents can correspond to any filename (for instance, a rename operation). This
representation also makes rename operations and physical data relocation easier.
40
Finally, directories and directory-trees are considered as metadata only, so they
are not physically created on the back-end storage. Instead, the directory-tree information is maintained in-memory by ZooKeeper.
4.1.1
Implementation Overview
The design of DUFS is broken down into three main components: the filesystem
interface based on FUSE, the Metadata management based on ZooKeeper and the
back-end storage provided by the underlying parallel filesystem. A DUFS client
instance is only a local software that does not interact directly with other DUFS
clients. Any necessary interaction is only made through ZooKeeper service or over
the back-end storage.
Client node
Client node
Application
Application
Application
FUSE interface
FUSE interface
A
DUFS
DUFS
Virtual path
Application
Virtual path
Physical path
FID
FID
Physical path
C
ZooKeeper
client library
Backend storage
client
ZooKeeper
client library
Backend storage
client
Backend storage
client
Backend storage
client
B
ZooKeeper
server
ZooKeeper
server
D
ZooKeeper
server
Backend distributed filesystem storage
ZooKeeper distributed coordination service
Figure 4.2: DUFS overview. A, B, C and D show the steps required to perform an
open() operation.
41
Figure 4.2 shows the basic steps required to perform an open() operation on a file
using DUFS.
A. The open() call is intercepted by FUSE which gives the virtual path of the file
to DUFS.
B. DUFS queries ZooKeeper to get the Znode based on the filename and to retrieve
the FID. If the file does not exist, ZooKeeper will return an error.
C. DUFS uses the deterministic mapping function to find the physical path associated to the FID.
D. Finally, DUFS opens the file based on its physical path. The result is returned
to the application via FUSE.
Alternatively, directory operations take place only at the metadata level, so only
ZooKeeper is involved and not the back-end storage. Thus, only steps A and B are
performed.
The following subsections describe the functions of the primary elements comprised within DUFS.
4.1.2
FUSE-based Filesystem Interface
We use FUSE to provide a POSIX-compliant filesystem interface to the applications. Thus, our DUFS prototype appears like a classic mount-point of the standard
filesystem.
Most of the basic file system operations like mkdir, create, open, symlink, rename,
stat, readdir, rmdir, unlink, truncate, chmod, access, read, write, etc. are implemented in DUFS. When an application wants to perform a filesystem operation, it
42
will operate on the virtual path exposed to it by DUFS. The filesystem operations
are translated into the FUSE specific operations, for example the open() call from
application is translated into the dufs open() in DUFS. Finally, for each filesystem
operation, DUFS can return the correct result after querying the ZooKeeper based
Metadata management service and the back-end storage as needed.
4.2
ZooKeeper-based Metadata Management
We use the ZooKeeper distributed coordination service to handle the consistency
threats posed by the distributed accesses from several DUFS clients simultaneously.
The synchronous ZooKeeper APIs were used for this purpose.
With our design, ZooKeeper will store a part of the virtual filesystem metadata.
It keeps track of details of the directories and files that get created. A separate Znode
is created in ZooKeeper for each directory or files created, and the virtual filesystem
hierarchy is represented inside ZooKeeper using Znodes.
ZooKeeper has several information fields associated to each Znodes. Some of the
standard fields include Znode creation time, list of children Znodes, etc. ZooKeeper
also has the provision to add a custom data field to each Znode. In DUFS, this
custom field is used to tell the Znode if it is representing a directory or a file. In the
latter case, the FID of the file is also stored in this field.
The ZooKeeper architecture uses multiple ZooKeeper servers. The data is replicated among all the servers. ZooKeeper uses coordination algorithms to ensure that
the Znodes hierarchy and its contents are consistent across the servers and that all
the modifications are applied in the same order in all the servers [14].
43
All these information are kept in memory and ZooKeeper servers can be located
close to DUFS clients. Thanks to this, ZooKeeper queries are fast and a large operation throughput can be performed. This raw throughput is studied in section 4.6.
However, the counterpart is that the ZooKeeper servers use a large amount of memory.
We study this impact in memory usage in section 4.7.
4.2.1
File Identifier
In our design, we use a File Identifier (FID) to uniquely represent the physical
contents of a file. This FID is stored in the custom data field of the Znode which
corresponds to the virtual path of a file. The FID is designed to be unique for each
newly created file. However, modifications to the contents of a file do not require
changing the FID.
In DUFS, the FID is a 128-bit integer. We propose a simple approach to generate
a unique FID at the DUFS client without requiring any coordination. The FID for
a file is generated by the client who initially creates the file. It is a concatenation of
a 64-bit client ID that uniquely represents that instance of DUFS client that created
the file and a 64-bit file creation counter that records the number of file creations
throughout the lifetime of that DUFS client. When a client is restarted, it acquires
another unique 64-bit client ID and its creation counter is reset to 0.
The FID is used by DUFS to deduce the physical location of the file and the physical filename. Firstly, the physical location of the data in the underlying filesystem
is generated using the deterministic mapping function. Secondly, the filename for
the data contents on the physical storage is generated from the FID. In this manner,
44
the contents of a file do not have to be renamed or moved between different physical
mounts when the virtual filename is renamed or moved.
4.2.2
Deterministic mapping function
The deterministic mapping function associates a physical location to each file’s
contents based on its FID. This function takes as input a 128-bit integer representing
the FID and returns a number between 1 and N , with N being the number of backend underlying storage systems. It has to be deterministic so that any DUFS client
can find the right location without coordination.
To achieve a good load-balancing between the different underlying storage mounts,
the mapping function has to distribute the FIDs in a fair manner. For this reason, the
mapping function of our current implementation is based on the MD5 hash function
that has this property [19]. Our mapping function is:
f id
4.2.3
7−→
MD5(f id) mod N
Back-end storage
Once a particular physical filesystem is chosen using the deterministic mapping
function, the data is accessed directly using the local mount-point of this distributed
filesystem. The filename is deterministically interpreted from the FID. Thus, it is independent of any virtual filename and the DUFS client does not need to communicate
with any other component to find the actual physical filename.
In DUFS, the physical filename used to store a file is the equivalent to the hexadecimal representation of the FID that was computed in the previous step. In order
to avoid congestion due to file creation at a single directory level, the hexadecimal
representation is divided into four parts to create multiple path components. The
45
first component has the filename, while the other components are used for the path
hierarchy. Figure 4.3 shows an example of the filename on the back-end storage for
the FID 0123456789abcdef.
FID:
0123456789abcdef
Physical filename: cdef / 89ab / 4567 / 0123
Figure 4.3: Sample physical filename generated from a given FID
This directory hierarchy is static and identical between all the back-end mountpoints. This static structure avoids any potential conflict.
4.3
Algorithm examples for Metadata operations
In this section, we give some algorithms for some Metadata operations in DUFS.
Figure 4.4 shows the algorithm for the mkdir() operation; Figure 4.5 shows the algorithm for the stat() operation.
4.3.1
Reliability concerns
The DUFS client does not have any state. All the required information are stored
either in ZooKeeper or in the back-end storage. So the DUFS reliability relies on the
ZooKeeper and back-end distributed filesystems.
For ZooKeeper, all the information are duplicated among all the servers. Thanks
to this, ZooKeeper is able to tolerate the failure of many servers. It needs to have
to majority of the servers alive to maintain consistency of the data [14]. Further,
46
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
Get the virtual path of the directory
Look for the corresponding Znode
if Znode exists then
return ’File exists’ error code
else
Generate the data field with type and metadata information
Create the corresponding Znode with ZooKeeper
if success then
return Success
else
Handle error
end if
end if
Figure 4.4: Algorithm for the mkdir() operation
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
Get the virtual path of the file/directory
Get the corresponding Znode with ZooKeeper
if Znode does not exist then
return ’No such file or directory’ error code
else
ZooKeeper returned the data field (type, FID, ...)
if Znode type is directory then
Fill the struct stat with information stored in ZooKeeper
return struct stat
else
Compute the physical location
Compute the physical path
Perform stat() on the physical file
return struct stat
end if
end if
Figure 4.5: Algorithm for the stat() operation
47
although each ZooKeeper server keeps all its data in memory, it is periodically checkpointed on disk. So, it can tolerate the failure of all servers by restarting them later.
Many distributed filesystems like Lustre provide fault tolerance. Data can be
replicated among multiple data servers. If such filesystems are used as a back-end
storage, it will benefit the DUFS availability.
4.4
Performance Evaluation
In this section, we conduct experiments to evaluate the performance of metadata
operations with our proposed design. These tests were performed on a Linux cluster.
Each node has a dual Intel Xeon E5335 CPU (8 cores in total) and 6GB memory.
A SATA 250GB hard drive is used as the storage device on each node. The nodes
are connected with 1 GigE for general purpose networking. Each node runs kernel
2.6.30.10. We dedicate a set of nodes as Lustre MDS and OSS (version 1.8.3) to
form multiple instances of Lustre filesystem. Another set of dedicated nodes work
as PVFS servers (version 2.8.2) to export multiple instances of PVFS filesystem.
Each client node mounts multiple instances of Lustre and PVFS filesystems and uses
DUFS to merge these distinct physical partitions into a logically uniformed partition.
ZooKeeper server runs along with the DUFS clients, and they provide distributed
coordination services over 1 GigE. We have used the mdtest benchmark [13] for our
experiments. We carried out experiments by creating a directory structure with a
fan-out factor of 10 and directory depth of 5. As the number of processes increases,
the number of files per directory also increases accordingly. We have also carried
out experiments where many files are created in a single directory. We have used
48
the same parameters and configuration while experimenting with different back-end
parallel file systems like Lustre, PVFS.
4.4.1
Distributed coordination service throughput and memory usage experiments
With DUFS design, each metadata operation has to go through the ZooKeeper
service before it is actually issued to the corresponding physical back-end filesystem.
In this section we performed experiments in order to study ZooKeepers throughput
for basic operations like zoo create(), zoo get(), zoo set() and zoo delete() using
ZooKeepers synchronous API. With a total of 8 DUFS clients in the experimental
setup, we varied the number of ZooKeeper Servers from 1 to 8. The results are shown
in Figure 4.6. For the zoo create(), zoo delete() and zoo set() operations, we can see
that with more number of ZooKeeper Servers the overall throughput drops down. This
is the expected behavior since this operation performs modifications on the Znodes.
Thus, all the ZooKeeper servers have to coordinate to ensure the consistency of their
replicated states. For the zoo get() operation, the overall throughput increases with
more number of ZooKeeper Servers. ZooKeeper performs very well in read dominant
workloads [8]. Indeed, each ZooKeeper server can serve the request independently
from each other.
Since ZooKeeper keeps all its data in memory, the memory usage can be a concern.
In the following experiment, we study the memory usage of ZooKeeper (java process),
and DUFS as well, when the number of metadata information increases. We have
designed a benchmark that creates a large number of directories and reports the
resident process memory size. For this experiment, all the processes ran on the same
node.
49
Number of Zookeeper Server = 1
Number of Zookeeper Server = 4
Number of Zookeeper Server = 8
16000
14000
12000
10000
8000
6000
4000
2000
0
Throughput (Ops/sec)
Throughput (Ops/sec)
Number of Zookeeper Server = 1
Number of Zookeeper Server = 4
Number of Zookeeper Server = 8
0
50
100
150
200
Number of client processes
250
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
0
(a) zoo create() operation
50
Throughput (Ops/sec)
Throughput (Ops/sec)
Number of Zookeeper Server = 1
Number of Zookeeper Server = 4
Number of Zookeeper Server = 8
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
50
100
150
200
Number of client processes
250
(b) zoo delete() operation
Number of Zookeeper Server = 1
Number of Zookeeper Server = 4
Number of Zookeeper Server = 8
0
100
150
200
Number of client processes
250
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
0
(c) zoo set() operation
50
100
150
200
Number of client processes
250
(d) zoo get() operation
Figure 4.6: ZooKeeper throughput for basic operations by varying the number of
ZooKeeper Servers
50
Additionally, in order to compare the memory usage of DUFS, we run the same
benchmark for a dummy FUSE filesystem which just does nothing, except forwarding
the requests to a local filesystem.
1400
Zookeeper
DUFS
Dummy FUSE
Memory Usage (MB)
1200
1000
800
600
400
200
0
0
0.5
1
1.5
2
Millions of directories created
2.5
Figure 4.7: Zookeeper memory usage and its comparison with DUFS and basic FUSE
based file system memory usage
The results are shown in Figure 4.7. We can see that the memory consumed
by DUFS is bounded and similar to a normal FUSE based file system, which is
what is expected. The ZooKeeper memory usage is proportional to the number of
created directories or files (Znode data size is similar for file or directory). From these
numbers, we can estimate that storing one million files or directory requires about
417 MB in memory. This drawback comes from the ZooKeeper design choice.
51
4.4.2
Scalability Experiments
For the scalability experiments we have a Zookeeper server running on each of
the DUFS client. We evaluate the scalability experiments by varying the number of
client processes from 4-256 and the number of physical nodes from 4,8 and 16. In these
experiments since the Zookeeper server are local to the DUFS clients the read request
will have a high throughput but in case of update a higher level of synchronization is
needed amongst the servers which are part of the ensemble.
By varying the number of physical nodes and the number of client processes running on them we can see that approach suggested in this chapter performs better
as compared to the basic variant of Lustre/PVFS as the number of client process
increases. As expected, the directory creation, directory removal, directory stat perform better. Directory stat being a read operation performs exceedingly well as compared to basic variant of Lustre. For file operations like file creation, file removal,
file stat also we can see a similar trend although we cannot get as high throughput
as compared to the directory operations but we perform we than the basic variant
of parallel filesystems. For file operation we have to contact the actual backend file
system to get the file attributes where as for the directory operation most of the
requests are satisfied at the Zookeeper level itself.
4.4.3
Experiments with varying number of distributed coordination service servers
In this section we performed experiments in order to study the outcome by varying
the number of ZooKeeper Servers. We used a set of 8 nodes with 8 DUFS clients,
52
DUFS
Basic Lustre
8000
8500
7500
8000
7000
7500
Throughput (Ops/sec)
Throughput (Ops/sec)
Basic Lustre
6500
6000
5500
5000
4500
4000
7000
6500
6000
5500
5000
4500
4000
3500
3500
3000
0
50
100
150
200
Number of processes
0
250
50
(a) Directory creation
100
150
200
Number of processes
250
(b) Directory removal
DUFS
Basic Lustre
110000
20000
100000
18000
90000
Throughput (Ops/sec)
Throughput (Ops/sec)
Basic Lustre
80000
70000
60000
50000
40000
30000
DUFS
16000
14000
12000
10000
8000
6000
20000
10000
4000
0
50
100
150
200
Number of processes
250
0
(c) Directory stat
Basic Lustre
50
100
150
200
Number of processes
250
(d) File creation
DUFS
Basic Lustre
12000
DUFS
45000
11000
40000
10000
Throughput (Ops/sec)
Throughput (Ops/sec)
DUFS
9000
8000
7000
6000
5000
4000
35000
30000
25000
20000
15000
3000
2000
10000
0
50
100
150
200
Number of processes
250
0
(e) File removal
50
100
150
200
Number of processes
250
(f) File stat
Figure 4.8: Scalability experiments with 8 Client nodes and varying number of client
processes
53
DUFS
Basic Lustre
10000
11000
9000
10000
Throughput (Ops/sec)
Throughput (Ops/sec)
Basic Lustre
8000
7000
6000
5000
4000
9000
8000
7000
6000
5000
4000
3000
2000
3000
0
50
100
150
200
Number of processes
0
250
(a) Directory creation
Basic Lustre
50
100
150
200
Number of processes
250
(b) Directory removal
DUFS
Basic Lustre
110000
20000
100000
18000
90000
Throughput (Ops/sec)
Throughput (Ops/sec)
DUFS
80000
70000
60000
50000
40000
30000
DUFS
16000
14000
12000
10000
8000
6000
20000
10000
4000
0
50
100
150
200
Number of processes
250
0
(c) Directory stat
Basic Lustre
50
100
150
200
Number of processes
250
(d) File creation
DUFS
Basic Lustre
14000
DUFS
70000
12000
Throughput (Ops/sec)
Throughput (Ops/sec)
13000
11000
10000
9000
8000
7000
6000
60000
50000
40000
30000
20000
5000
4000
10000
0
50
100
150
200
Number of processes
250
0
(e) File removal
50
100
150
200
Number of processes
250
(f) File stat
Figure 4.9: Scalability experiments with 16 Client nodes and varying number of client
processes
54
which use a number of ZooKeeper servers varying from 1 to 8. We measured the
operation throughput and we compared it against the Basic Lustre throughput.
The results are presented in Figure 4.10. As expected, for read operations like file
stat() and directory stat(), it shows a significant performance improvement when the
number of ZooKeeper servers is increased. For the other operations, the effect of the
number of ZooKeeper servers is lesser.
Finally, these results show that using 8 ZooKeeper servers is a good compromise
for our configuration.
4.4.4
Experiment with different number of mounts combined
using DUFS
In this section we performed experiments to study the influence of varying number
of back-end storage to be combined by DUFS. For this experiment we had an ensemble
of 8 zookeeper servers. Since the directory operations do not touch the back-end
distributed filesystem, we only focus on file operations for this experiment.
Figure 4.11 shows the throughput of file operations for 2 and 4 back-end storage
and for different number of client processes. We also compare this throughput to the
Basic Lustre case. Using 4 back-end storage instead of 2 provides a small improvement
for file creation and removal. For file stat(), we can see an improvement of more than
37% with 256 client processes.
Although the file operations are uniformly distributed among the back-end storage, there is an indirection to a ZooKeeper server. File removal and creation require
a metadata modification. The cost of this modification overtakes the benefit of multiple back-end storage. The file stat() operation only requires to read the metadata,
55
Basic Lustre
1 Zookeeper
4 Zookeeper
8 Zookeeper
6,000
8,000
Throughput (Ops/sec)
Throughput (Ops/sec)
Basic Lustre
1 Zookeeper
4 Zookeeper
8 Zookeeper
7,000
5,000
4,000
3,000
2,000
1,000
6,000
5,000
4,000
3,000
2,000
1,000
0
64
0
128
256
Number of client processes
(a) Directory creation
64
128
256
Number of client processes
(b) Directory removal
Basic Lustre
1 Zookeeper
4 Zookeeper
8 Zookeeper
100,000
14,000
90,000
12,000
Throughput (Ops/sec)
Throughput (Ops/sec)
80,000
70,000
60,000
50,000
40,000
30,000
10,000
Basic Lustre
1 Zookeeper
4 Zookeeper
8 Zookeeper
8,000
6,000
4,000
20,000
2,000
10,000
0
0
64
128
256
Number of client processes
(c) Directory stat
64
128
256
Number of client processes
(d) File creation
Basic Lustre
1 Zookeeper
4 Zookeeper
8 Zookeeper
Basic Lustre
1 Zookeeper
4 Zookeeper
8 Zookeeper
9,000
60,000
8,000
Throughput (Ops/sec)
Throughput (Ops/sec)
50,000
7,000
6,000
5,000
4,000
3,000
2,000
40,000
30,000
20,000
10,000
1,000
0
64
0
128
256
Number of client processes
(e) File removal
64
128
256
Number of client processes
(f) File stat
Figure 4.10: Operation throughput by varying the number of Zookeeper Servers
56
7,000
8,000
Basic Lustre
DUFS:2 mounts
DUFS:4 mounts
7,000
Throughtput(Ops/sec)
Throughtput(Ops/sec)
6,000
5,000
4,000
3,000
2,000
1,000
Basic Lustre
DUFS:2 mounts
DUFS:4 mounts
6,000
5,000
4,000
3,000
2,000
1,000
0
64
128
Number of processes
0
256
(a) Directory creation
64
128
Number of processes
256
(b) Directory removal
Basic Lustre
DUFS with 2 Lustre backend storages
DUFS with 4 Lustre backend storages
100,000
Throughtput(Ops/sec)
80,000
14,000
Basic Lustre
DUFS:2 mounts
DUFS:4 mounts
12,000
Throughput (Ops/sec)
90,000
70,000
60,000
50,000
40,000
30,000
10,000
8,000
6,000
4,000
20,000
2,000
10,000
0
64
0
128
256
Number of processes
(c) Directory stat
64
128
256
Number of client processes
(d) File creation
Basic Lustre
DUFS with 2 Lustre backend storages
DUFS with 4 Lustre backend storages
Basic Lustre
DUFS with 2 Lustre backend storages
DUFS with 4 Lustre backend storages
10,000
60,000
9,000
50,000
Throughput (Ops/sec)
Throughput (Ops/sec)
8,000
7,000
6,000
5,000
4,000
3,000
2,000
40,000
30,000
20,000
10,000
1,000
0
64
0
128
256
Number of client processes
(e) File removal
64
128
256
Number of client processes
(f) File stat
Figure 4.11: File operation throughput for different numbers of back-end storage
57
which is very fast with ZooKeeper. That is why we see a clear benefit of increasing
the number of back-end storage in this case.
In any parallel file system if the directory is spread across multiple partitions on a
server then the ’ls -l’ operation can be costly. It is even costlier in a bigger environment
where the directories are spread across different partitions on different servers. With
the approach presented in the paper we can get significant improvement in the ’ls -l’
operation even though the files might be evenly distributed across different partitions
on different servers.
4.4.5
Experiments with different back-end parallel filesystems
In this section, we study the performance of our DUFS prototype in comparison
with two distributed filesystems: Lustre and PVFS2. To keep a fair comparison, we
also use Lustre and PVFS2 as our back-end storage. We study the scalability by
increasing the number of client processes.
In these experiment we had 8 DUFS clients and 8 Zookeeper servers. The Zookeeper
Server and DUFS clients were running on same nodes.
From Figure 4.12, we can see that DUFS with Lustre as a back-end physical
filesystem, it can outperform Basic Lustre. We can see similar results even in the
PVFS case. Also one notable point is that for the directory operations, we see a
similar trend for any back-end physical mount. This is expected because in DUFS,
directory operations only rely on ZooKeeper. Also, for the file operations like creation/stat/removal, DUFS with Lustre as back-end filesystem performs way better
than DUFS with PVFS2 as the back-end filesystem. This is because in that case, the
58
Basic Lustre
DUFS Approach : Merge 2 physical Lustre mounts
Basic PVFS
DUFS Approach : Merge 2 physical PVFS mounts
Basic Lustre
DUFS Approach : Merge 2 physical Lustre mounts
Basic PVFS
DUFS Approach : Merge 2 physical PVFS mounts
Throughput (Ops/sec)
Throughput (Ops/sec)
12000
10000
8000
6000
4000
2000
0
0
50
100
150
200
Number of processes
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
0
250
50
(a) Directory creation
Basic Lustre
DUFS Approach : Merge 2 physical Lustre mounts
Basic PVFS
DUFS Approach : Merge 2 physical PVFS mounts
Throughput (Ops/sec)
Throughput (Ops/sec)
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
50
100
150
200
Number of processes
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
250
0
(c) Directory stat
Throughput (Ops/sec)
Throughput (Ops/sec)
10000
8000
6000
4000
2000
0
100
150
200
Number of processes
100
150
200
Number of processes
250
Basic Lustre
DUFS Approach : Merge 2 physical Lustre mounts
Basic PVFS
DUFS Approach : Merge 2 physical PVFS mounts
12000
50
50
(d) File creation
Basic Lustre
DUFS Approach : Merge 2 physical Lustre mounts
Basic PVFS
DUFS Approach : Merge 2 physical PVFS mounts
0
250
(b) Directory removal
Basic Lustre
DUFS Approach : Merge 2 physical Lustre mounts
Basic PVFS
DUFS Approach : Merge 2 physical PVFS mounts
0
100
150
200
Number of processes
250
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
0
(e) File removal
50
100
150
200
Number of processes
250
(f) File stat
Figure 4.12: Operation throughput with respect to the number of clients for Lustre
and PVFS2
59
back-end storage is actually used and thus the throughput of these operations depend
on the performance of this back-end filesystem .
From the scalability point of view, we see that Lustre and PVFS2 do not scale
very well. When the number of client processes grows significantly, their performance
drops down. Conversely, DUFS does not perform so well at small scale, however it
can outperform Lustre for all operation for 256 client processes. In all the case, DUFS
with PVFS2 back-end storage is clearly better than PVFS2 alone.
For directory creation with 256 client processes, DUFS outperforms Lustre by a
factor of 1.9, and PVFS2 by a factor of 23.
Finally, we can see that for directory/file stat the approach discussed in the paper
performs exceedingly well as compared to its basic variant, i.e., Lustre and PVFS2.
With respect to file stat() with 256 processes, our approach is 1.3 and 3.0 times faster
than Lustre and PVFS, respectively. This is mainly because Zookeeper performs good
in case of read dominant workloads.
4.5
Summary
We have designed a Distributed Metadata Service Layer and evaluated its benefits
to parallel file systems. Distributed metadata management is a hard problem since
it involves taking care of various consistency and reliability aspects. Also, scaling
metadata performance is more complex than scaling raw I/O performance. With
distributed metadata, this complexity further increases. This leads to a primary goal
while designing a Distributed Metadata Service Layer - to improve on the scalability
aspect while taking care of consistency and reliability. With our approach, we are able
to maintain good performance even with a large number of client. With 256 client
60
processes, we are able to outperform Lustre for the 6 metadata operations namely
directory creation, directory removal, directory stat, file creation, file removal and
file stat.
61
Chapter 5: CONTRIBUTIONS AND FUTURE WORK
In this thesis, we have designed metadata management approaches for managing
metadata in parallel filesystems. Our work involved design of a scheme to delegate
metadata at client side so as to minimize the load on single metadata server (MDS).
We also designed a new approach for distributed metadata management and the
various challenges faced by it.
5.1
Summary of Research Contributions and Future Work
The research in this thesis aims towards solving two important problems faced in
parallel filesystem environments. They are as follows :
1. Single metadata server is a bottleneck. So we have designed a scheme for metadata management to minimize the load on a single MDS.
2. Recent trends in high-performance computing have also seen a shift toward
distributed resource management. With distributed metadata we need to care
of some complex issues related to reliability and consistency. There is always a
compromise with maintaining reliability/consistency in the file system and at
the same time achieving a scalable solution. Our approach tries to solve the
problem of distributed metadata management with the primary aim to maintain
62
the reliability and consistency of the file system and at the same time achieving
improving the scalability of the filesystem.
5.1.1
Delegating metadata at client side
We have described a mechanism for minimizing the load on a single metadata
server for the Lustre Filesystem. A single metadata server managing the entire filesystem namespace is common in most of the parallel filesystem approaches to manage
metadata. In this design we minimize the load on the MDS and hence the memory
pressure on the MDS by delegating the metadata at the client side. We evaluated our
design and compared it with basic variant of Lustre. We can see that for metadata
operation like file open() the throughput increases as the number of client process
increases whereas with basic variant of Lustre the throughput decreases. We can see
similar behavior when the number of files in the file pool are increased. One of the
primary reason for the slowdown in basic variant of Lustre is that as the file pool size
goes on increasing the amount of file metadata to be kept in the MDS cache increases.
The MDS metadata cache won’t be flushed until it reaches a threshold which varies
depending on the physical memory at the MDS. But if the metadata that was initially
in the caches gets flushed out and is accessed by some of the client then the MDS has
to perform a disk I/O to get the needed metadata from the disk, which is a costlier
operation. With the design proposed in this chapter we perform an extra hop to the
client which holds the metadata information for the file to be accessed. And with the
use of low latency, high bandwidth interconnects like Infiniband the cost of this extra
hop is negligible as compared to the expensive disk I/O. So in brief, the design aims
to take advantage of subtree partitioning and hashing based approaches to minimize
the load at the MDS and also preventing it from being a single point of bottleneck.
63
In future, we plan to carry out studies to use this approach in MPI-IO kind of
environment where it will be really beneficial. In such a kind of environment a single
client can traverse the path and grab the EA information and the striping details.
Next this information can be brodcasted to other processes using a MPI Broadcast
and thus we can save a considerable amount of time in path resolution. The amount
of time saved in RPC is approximately equal to (number of path elements) * (number
of clients accessing the file). We also plan to design a scheme for distributed metadata
management.
5.1.2
Design of a decentralized metadata service layer for
distributed metadata management
We have designed a Distributed Metadata Service Layer and evaluated its benefits
to parallel file systems. Distributed metadata management is a hard problem since
it involves taking care of various consistency and reliability aspects. Also, scaling
metadata performance is more complex than scaling raw I/O performance. With
distributed metadata, this complexity further increases. This leads to a primary goal
while designing a Distributed Metadata Service Layer - to improve on the scalability
aspect while taking care of consistency and reliability. In order to study this topic
we have designed a FUSE based file system, namely the Distributed Union File System (DUFS). DUFS can combine multiple mounts of a Parallel File System into a
single Virtual File System which is exposed to the user applications. We have used
ZooKeeper as a distributed coordination service to take care of metadata reliability and consistency management. Finally, our ZooKeeper-based prototype shows the
main trends that can be expected when using a distributed coordination service for
metadata management. From our experiments, we can see that for higher number of
64
processes running on the client nodes and as the load on the client nodes increase,
we can scale well with the approach proposed in the paper as compared to the other
studied distributed filesystems Lustre and PVFS2. While Lustre performs very well
for the small number of clients, its performance drops down when the number of
client increases. With our approach, we are able to maintain good performance even
with a large number of client. With 256 client processes, we are able to outperform
Lustre for the 6 metadata operations namely directory creation, directory removal,
directory stat, file creation, file removal and file stat.
One major drawback of our approach is the memory usage because the ZooKeeper
servers keep all their data in memory. Future work will focus on addressing this issue.
Additionally, we plan to replace our MD5-based mapping function with one based on
consistent hashing [26]. This approach will allow to dynamically add and remove
back-end storage while ensuring that the amount of data to relocate will be bounded.
65
Bibliography
[1] Clustered MetaData.
Metadata.
http://wiki.lustre.org/index.php/Clustered
[2] IOZONE Filesystem benchmark. http://www.iozone.org/.
[3] Isilon Systems Inc. http://www.isilon.com.
[4] Oracle Lustre File System. http://wiki.lustre.org/index.php/MainPage.
[5] Postmark
File
System
Benchmark.
internet.org/brad/FreeBSD/postmark.html.
http://shub-
[6] Amina Saify, Garima Kochhar, Jenwei Hsieh, Onur Celebioglu. Enhancing HighPerformance Clusters with Parallel File Systems.
[7] Peter Braam Braam and Michael Callahan. The intermezzo file system, 1999.
[8] Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Lan Xue. Efficient
metadata management in large distributed storage systems. In Proceedings of
the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and
Technologies (MSS’03), MSS ’03, Washington, DC, USA. IEEE Computer Society.
[9] Mike Burrows. The chubby lock service for loosely-coupled distributed systems.
OSDI ’06, Berkeley, CA, USA. USENIX Association.
[10] PhilipH. Carns, Walter B. Ligon, III, Robert B. Ross, and Rajeev Thakur. Pvfs:
A parallel file system for linux clusters. In In Proceedings of the 4th Annual
Linux showcase and conference. MIT Press, 2000.
[11] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber.
Bigtable: A distributed storage system for structured data.
[12] G.Goodson, B. Welch, B.Halevy, D.Black, and A.Adamson. Nfsv4 pnfs extensions. technical report.
66
[13] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In Proceedings of the nineteenth ACM symposium on Operating systems
principles, SOSP ’03, New York, NY, USA, 2003. ACM.
[14] Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed.
Zookeeper: wait-free coordination for internet-scale systems. In Proceedings of
the 2010 USENIX conference on USENIX annual technical conference, USENIXATC’10, Berkeley, CA, USA, 2010. USENIX Association.
[15] James H. Morris, Mahadev Satyanarayanan, Michael H. Conner, John H.
Howard, David S. Rosenthal, and F. Donelson Smith. Andrew: a distributed
personal computing environment. Commun. ACM.
[16] Swapnil V. Patil, Garth A. Gibson, Sam Lang, and Milo Polte. Giga+: scalable
directories for shared file systems. In Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing ’07,
PDSW ’07, New York, NY, USA, 2007. ACM.
[17] Brian Pawlowski, Chet Juszczak, Peter Staubach, Carl Smith, Diane Lebel, and
David Hitz. Nfs version 3 - design and implementation. In In Proceedings of the
Summer USENIX Conference, pages 137–152, 1994.
[18] David Quigley, Josef Sipek, Charles P. Wright, and Erez Zadok. Unionfs: Userand communityoriented development of a unification filesystem. In In Proceedings
of the 2006 Linux Symposium, 2006.
[19] Ronald A. Rivest. The md5 message digest algorithm. Internet RFC 1321, 1992.
[20] Drew Roselli, Jacob R. Lorch, and Thomas E. Anderson. A comparison of file
system workloads. In Proceedings of the annual conference on USENIX Annual
Technical Conference, ATEC ’00, Berkeley, CA, USA, 2000. USENIX Association.
[21] Mahadev Satyanarayanan, James J. Kistler, Puneet Kumar, Maria E. Okasaki,
Ellen H. Siegel, David, and C. Steere. Coda: A highly available file system
for a distributed workstation environment. IEEE Transactions on Computers,
39:447–459, 1990.
[22] Frank Schmuck and Roger Haskin. Gpfs: A shared-disk file system for large
computing clusters. In Proceedings of the 1st USENIX Conference on File and
Storage Technologies, FAST ’02. USENIX Association.
[23] Sage A. Weil, Kristal T. Pollack, Scott A. Brandt, and Ethan L. Miller. Dynamic
metadata management for petabyte-scale file systems. In Proceedings of the 2004
ACM/IEEE conference on Supercomputing, SC ’04, Washington, DC, USA, 2004.
IEEE Computer Society.
67
[24] Gongye Zhou, Qiuju Lan, and Jincai Chen. A dynamic metadata equipotent
subtree partition policy for mass storage system. In Proceedings of the 2007
Japan-China Joint Workshop on Frontier of Computer Science and Technology,
FCST ’07, Washington, DC, USA. IEEE Computer Society.
[25] Yifeng Zhu, Hong Jiang, and J. Wang. Hierarchical bloom filter arrays (hba): a
novel, scalable metadata management system for large cluster-based storage. In
Proceedings of the 2004 IEEE International Conference on Cluster Computing,
Washington, DC, USA. IEEE Computer Society.
68