The Babel File System

2014 IEEE International Congress on Big Data
The Babel File System
Moisés Quezada-Naquid 1, Ricardo Marcelín-Jiménez 1,3, José Luis González-Compeán 2
1
2
Department of Electrical Eng.
UAM-Iztapalapa
México D.F., MEXICO
[email protected], [email protected]
Dept. of Systems and Computation
Cd. Valles Institute of Technology
Cd. Valles, San Luis Potosí, MEXICO
[email protected]
3
Collaborates with the research team
at the Centro Público de Información y Documentación para la Industria (INFOTEC)
is a redundant set of servers, also called proxies,
coordinating a vast number of storage devices, whose joined
capacities can scale up to the order of petabytes. We have
built a dependable system that is oriented to support many
different applications.
It is important to notice that our proposal is able to
integrate different operating systems and storage
technologies, into a unified framework. Users are not, and do
not require to be, aware of the underlying technologies that
support their services. This approach is known as “objectbased storage systems” [10,11].
In this paper we describe the design process that we
followed and the lessons that we have learned during the
development of the Babel File System. Our aim is to offer a
competitive and cost-effective alternative for in-house large
scale storage. Up to date, we have deployed 2 services based
on our platform: a WebDAV server and a biomedical image
storage service (PACS: Picture Archiving and
Communication System).
The rest of this paper includes the following parts: in
section II, we present some of the most important storage
systems that have influenced our work and review the
lessons we learned from these leading designs. In section III,
we describe the functional and non-functional requirements
that guided our work. In section IV we introduce the most
important operational entities that came out from the analysis
stage. In section V, we shortly introduce 2 applications that
we developed, based on the storage capacities supported by
Babel. Finally, in section VII, we summarize our design
decisions and present the current ongoing work that Babel
has triggered.
Abstract— The Babel File System is a dependable, scalable and
flexible storage system. Among its main features we underline
the availability of different types of data redundancy, a careful
decoupling between data and metadata, a middleware that
enforces metadata consistency, and its own load-balance and
allocation procedure which adapts to the number and
capacities of the supporting storage devices. It can be deployed
over different hardware platforms, including commodity
hardware. Our proposal has been designed to allow developers
to settle a trade-off between price and performance, depending
on their particular applications.
Keywords: BFS, massive cluster-storage, dependability,
scalability, flexibility, parallelism, commodity hardware.
I.
INTRODUCTION
The virtualization of fault-tolerant storage systems has
become a common solution that provides both data
availability and reliability in many applications. This is the
case of web storage services such as GFS [1], Amazon S3
[2], Nirvanix CloudNAS [3], or Microsoft SkyDrive [4]
(among others), which are successful business offering
permanent file availability. Nevertheless,
it has been
observed that the cost of external web storage is several
times higher than the value of the involved storage
components [5, 6].
As information is becoming a very important asset,
organizations started considering the possibility of building
their own large scale storage capacities as a promising
alternative to keep the control of this valuable resource. This
decision is fostering studies that aim to develop new storage
systems based on a rather flexible set of components as well
as the guidelines for configuring these environments in a
cost-effective manner [7, 8].
Over the last years we have been working on the
construction of a storage solution, called the Babel File
System (after “The Library of Babel”, a short story by Jorge
Luis Borges included in “The garden of the forking paths”
[9]), supporting features that can be found on similar systems
such as the aforementioned. A storage cell, as we call to an
instance of Babel, is made up from a set of devices,
connected by means of a local area network. Users do not
perceive but a single virtual device which supports file
upload and download operations. Behind this interface there
978-1-4799-5057-7/14 $31.00 © 2014 IEEE
DOI 10.1109/BigData.Congress.2014.42
II. RELATED WORK
In this section we will present those systems that have
inspired our work. It is important to point out that, by no
means, we present an exhaustive search. Instead, we consider
that the following is a very representative list, including the
most important milestones on this trend.
Generally speaking, the systems that we review advocate
the construction of a virtual storage device based on a set of
storage components where redundant information is stored.
Under this approach, the failure of an individual component
does not immediately cancel the availability of a given file,
242
234
a client and server component. Servers are typically
deployed as storage bricks, with each server running a
glusterfsd daemon to export a local file system as a
volume. The GlusterFS client process integrates
composite virtual volumes from multiple remote servers
using stackable translators. By default, a file is stored as
a single information unit, but striping of files across
multiple remote volumes is also supported.
The GlusterFS server is kept minimally simple: it
exports an existing file system as-is, leaving it up to
client-side translators to structure the store. The clients
themselves are stateless, do not communicate with each
other, and are expected to have translator configurations
consistent with each other. GlusterFS relies on an elastic
hashing algorithm, rather than using either a centralized
or distributed metadata model. With version 3.1 and
later of GlusterFS, volumes can be added, deleted, or
migrated dynamically, helping to avoid coherency
problems, and allowing GlusterFS to scale up on
commodity hardware by avoiding bottlenecks that
normally affect more tightly-coupled distributed file
systems.
Among the lessons that we have learned [16, 17, 18, 19]
we underline the following principles:
since the remaining components do have sufficient
information in case it is required to rebuild the file. Storage
devices are “hidden” to the final user, or at least, they are not
the first component to contact.
• CEPH is a distributed storage system initially developed
at the University of Santa Cruz [12]. It is oriented to
support massive volumes of scientific data. CEPH
design considers two principles: it does not exist an
entry table to find the place where a file has been
allocated. Instead, this place is calculated by means of a
pseudo-random function. When a user contacts the
system in order to retrieve a previously stored file, she
can be assigned to any available server. This second
principle means that servers do not control a fixed set of
storage nodes.
• GFS was designed by Google Inc. in order to support
the massive storage of information generated or
collected by Google itself [1]. Among the principles that
we highlight from this system, we find that servers are
in charge of control and monitoring tasks in order to
detect failures, trigger repairing procedures and tuning
system performance. Balance is enforced using a chunk
size, which defines the maximum length of an
information unit to be allocated. If a file exceeds this
parameter it will be split up into as many chunks as
necessary to guarantee that each resulting chunk is
smaller or equal to this maximum size.
• The HDFS is a distributed, scalable, and portable file
system written in Java for the Hadoop framework [13].
Each node in a Hadoop instance typically has a single
data node. A cluster of data nodes form the HDFS
cluster. Each data node serves up blocks of data over
the network using a block protocol specific to HDFS.
The file system uses the TCP/IP layer for
communication; clients use RPC to communicate
between each other. The HDFS stores large files (an
ideal file size is a multiple of 64 MB), across multiple
machines. It achieves reliability by replicating the data
across multiple hosts, and hence does not require RAID
storage on hosts. With the default replication value, 3,
data is stored on three nodes. Data nodes can talk to
each other to rebalance data, to move copies around, and
to keep the replication above a given service level
agreement. The HDFS requires one unique server, the
name node. This is a critical point of failure for an
HDFS installation.
• The Lustre file system architecture was developed as a
research project at Carnegie Mellon University [14]. A
Lustre file system has the following major functional
units: A single metadata server having a single metadata
target (MDT) per Lustre filesystem, where namespace
metadata is kept, one or more object storage servers
(OSS) that store file data on one or more object storage
targets (OST) and clients.
• GlusterFS is based on a stackable user space design
without compromising performance [15]. It has found a
variety of applications including cloud computing,
biomedical sciences and archival storage. GlusterFS has
1) For systems built from a large number of parts, failures
must be regarded as a common issue. Under this
assumption, the design process must consider different
types of redundant resources that bail out the system
from transient failures. Also, systems must include
monitoring or supervision capabilities, as well as selfrepair functions.
2) Massive storage systems have an interface that offers a
unified entry point. Behind this interface, there is a set
of servers that filter each incoming request and manage
individual storage devices, where redundant
information is finally allocated.
3) Most of the individual storage devices also include
processing capabilities. Therefore, those processing
functions that make part of the storage service can be
trusted on these very devices. With this decision, a high
degree of parallelism can be achieved, which has a
potential benefit on service times and availability. This
decision implies the definition of 2 key parameters: a
maximum length processing unit (also called chunk or
fragment size), and a maximum length storage unit
(also called block size). In turn, parallelism calls for a
very resourceful design including load balance
procedures for either processing or storage units.
4) Under very demanding conditions, designers must
consider the possibility of using high speed temporal
devices that work as buffers between the application
clients and the final storage device.
5) In order to address reliability concerns, it appears to be
convenient a decoupling between data and metadata.
Data refers to the storage units that code the users´
information. Meanwhile, metadata refers to the
235
243
between her computer and a given logical volume, also
called storage node. This node is selected by the proxy,
according to a random selection procedure. It also
records this operation on the current log and creates a
new entry on its database (metadata), in order to
support the future retrieval.
information the system is required to keep in order to
retrieve data. The former is allocated on the final
storage devices, while the latter is kept on the system
servers. Notice that, a redundant set of servers implies a
potential inconsistency on the replicated information
that each server has recorded. Also notice that, as the
load balance procedure is invoked, metadata may
become out of date.
4.
III. SYSTEM OVERVIEW AND DESIGN COSIDERATIONS
In this section we present the functional and non-functional
requirements that guided our design [20, 21, 22]. Functional
requirements are explained in terms of their corresponding
use cases. As for non-functional requirements, we present a
list of the quality issues that have been taken into account. At
the same time, we introduce, on the fly, the key processing
units that translate these use cases into concrete operations
and also describe the way these units impact on the system’s
quality and performance.
It is important to notice that, in a highly dependable
system such as Babel, functional and non-functional
requirements are tightly intertwined, as the reader has
probably realized.
Use cases can be divided in two families: the first one
supports users’ account management and includes account
creation, cancellation, connection and disconnection. The
second one involves resource management and includes file
upload, download, delete, disk replacement, scaling and
balance. For the sake of brevity we will now focus on the
upload case, since we consider that involves most of the
parameters and processing units that take part on each of the
remaining cases.
5.
6.
7.
8.
Figure 1. Babel upload sequence.
Let us suppose that a given user, say Mary, wants to
upload a file. As we describe the steps required to fulfill her
service, we will introduce the components of the system.
1.
2.
3.
Now, the storage node starts receiving the file stream.
To foster processing balance, the system defines a
parameter called maximum storage unit (MSU). Files
are split up into as many processing units, or fragments,
as necessary to guarantee that the length of each
fragment does not exceed the given MSU.
Each storage node maintains a revolving or cyclic list
called ring, with the identities of the storage nodes that
collaborate with it. According to the order they appear
in the ring, the node in charge allocates each of the
resulting fragments to its collaborators, starting from
the last appointed place on.
To achieve fault-tolerance and high availability,
fragments undergo a redundancy generation procedure.
The system supports two different procedures: either
simple replication or the information dispersal
algorithm (IDA) [23]. Depending on a service level
agreement, which is related to Mary's profile, the node
that has received a fragment selects the corresponding
procedure. Replication creates two identical copies of
the fragment, called blocks. Instead, IDA creates a set
of k different storage units, also called blocks, such that
the original fragment can be recovered provided that
any m blocks are available. Whether the node uses
replication or IDA, the resulting blocks are
accommodated invoking a local function called oracle.
This is a deterministic function that calculates a storage
node id, where the block will be finally allocated. Each
block has a fixed set of properties that we refer to as its
signature. An oracle maps each block signature into a
storage node id. Notice that it is possible to retrieve a
block, provided that its corresponding signature is
known and an instance of the oracle is available,
regardless of the place where the oracle is invoked.
A node which is required to process or store an
information unit (file, fragment, or block), confirms the
completion of this task to the immediate source that has
required its capacity.
A proxy keeps a replica of its metadata on each of the
proxies that make up the cell interface. In a similar way,
a storage node replicates its metadata on a couple of
nodes that make part of its ring. In any case, metadata
consistency is an important aspect to care about.
Three basic considerations must be taken into account to
build a dependable storage system: availability, faulttolerance and scalability. In order to fulfill these goals, a
thorough design should consider information redundancy
techniques and data placement strategies. The former refer to
Mary contacts the cell interface.
An appointed server or proxy validates her as an
authorized user.
As Mary submits her file, the proxy creates a stream
236
244
the methods that produce redundant information to support
availability and fault-tolerance. The latter refer to the
(re)allocation of data within the available storage devices, to
support changes on the initial configuration.
Redundant information, as we already presented, can be
obtained from simple replication that generates k copies,
called storage units or blocks, of each original fragment (or
processing unit). This is the principle followed by GFS [1]
and HDFS [13]. Alternatively, redundancy can also be
produced using error-correcting techniques. For instance,
using the information dispersal algorithm (IDA) which is a
linear transformation over a finite field.
In turn, a data placement strategy, or oracle, determines
the storage device where a given block, has to be allocated
[19, 23]. This function should be supported efficiently and
considering the dynamics of the overall system, including
new devices that come into operation, or old devices that
experience temporal or permanent failures and are eventually
replaced.
The non-functional requirements that we have
considered are: cost, reliability/fault-tolerance, scalability,
availability, modularity and interface.
•
•
•
•
•
•
•
A. The information dispersal algorithm
The IDA[26, 27] achieves fault tolerance by means of
information redundancy. Let F be a fragment, F is
transformed into k files called dispersals. Each of size |F|/m,
where k > m > 1. Then, the dispersals are handed over to a
set of disks, i.e. each disk stores one of the k dispersals of F.
From the algorithm properties it is granted that if any k - m
dispersals are lost, the original information can be
reconstructed from the m surviving dispersals. Due to these
properties, not only can a fragment F be recovered from any
m of its blocks, but also any missing block can be recovered
provided that m blocks remain available. In this case, the cost
of the reconstruction can be accepted due to the increased
storage capacity and superior fault-tolerance.
A very important matter that we addressed is the quest of
what we could call a “good” combination of IDA parameters
(k,m). Let us recall that each dispersal is 1/m the size of
F and, therefore, there is an excess of information, or
information redundancy, equal to (k-m)/m the size of F. In
our implementation we consider that the combination (5,3) is
a good trade-off between fault-tolerance and redundancy
The system performs two different types of operations: i)
user operations (i.e., file storage, retrieval and erasing)
which only involve active components and ii) system
operations including account management and the recovery
procedure. The latter is triggered when an active component
crashes and a spare component is selected to replace the
faulty one
When the recovery procedure starts, a spare component is
appointed in order to replace the one that failed. As part of
this process all pertinent data from other components has to
be retrieved in order to fully reconstruct the information
previously available in the missing device.
We developed a performance study to evaluate the mean
time to failure of our system and the impact of different
factors on this metric of performance [28]. For this goal we
built a discrete event simulator using OMNeT++ [29]. We
considered the following assumptions: i) the system is
working below its maximum capacity, ii) the time to recover
a dispersal is linearly dependent on the size of the missing
dispersal, iii) any user operation is interrupted during
recovery, iv) the time to failure of all storage components is
modeled by independent and identically distributed negative
exponential random variables, v) repair times are also
represented by independent and identically distributed
negative exponential random variables.
For k = 5, our experiment design included 4 parameters:
a) repair time, b) MTTF of individual storage components, c)
initial number of spare components and, d) initial number of
active components.
We met a cost-effective building solution that can be
assembled even from commodity components.
As we mentioned before, the key to reliability/faulttolerance is the utilization of redundant resources, i.e.
redundant devices, redundant information and
redundant processing capacities.
Scalability has been accomplished designing an
adaptive oracle that enforces storage balance and adapts
its work to a growing number of storage devices.
Availability, as we understood, is achieved using
parallelism either to process information, or to store and
retrieve redundant blocks.
Modularity is solved articulating a loosely-coupled set
of modules (and plug-ins) that can be modified
independently from each other.
Portability refers to the possibility of deploying the
system over different hardware platforms. We
understood the necessity of a multiplatform
programming language and we chose python as the
fundamental building tool.
Finally, we devised a small and simple command line
interface that can be easily extended to match with
different client applications, supported by a GUI or a
Web interface.
IV. THE KEYS TO BABEL
Besides the very software design process and the
resulting blueprints, we consider that the major contributions
of our project are the following operational units: i) a
parameterized IDA module that can be easily adapted to
different conditions, ii) our own oracle design and its
corresponding
module
implementation,
iii)
an
implementation of the Paxos protocol [24, 25], enforcing
metadata consistency,
Figure 2, shows the longest MTTF obtained. In solid line
we see the corresponding experimental histogram. Besides,
we compare this result to a fitted exponential negative pdf,
in dotted line. Both have a mean time equal to 4,981.60
years. This case corresponds to the following combination
of parameters: mean repairing time equal to 5 hrs, mean
237
245
It is quite important to notice that, although a given block
may change of allocation during its lifetime, there exists a set
of attributes, that we call the signature of the block, that
feature this unit and that never change, such as its name or
date of creation, for instance. Therefore, as we have already
stated, the oracle receives the signature of a block, in order to
calculate its current location. The oracle adapts its function
to the actual number of storage devices, but the metadata,
required for storage and retrieval, remains without changes.
This principle eases the burden of metadata management.
Let us suppose that Mary wants to store a file
“report.txt”. Considering its size and the definition of the
parameter MSU, the file is split into 2 fragments:
“USRMary:report.txt.f1” and “USRMary:report.txt.f2”. Now
let us assume that each fragment is processed using an
instance of IDA, with parameters (5,3). If we want to know
the position, i.e. the device where the 3rd block (dispersal) of
the 2nd fragment should be allocated, we introduce its
signature “USRMary:report.txt.f2.b3” to the oracle, which is
charged to return the identity of the final device.
It is a common practice to study the object allocation
problem using a “bins and balls” model, borrowed from
probability theory [30]. Storage devices are regarded as bins,
and redundant data units, or objects, as balls. We call a
redundancy group to the set of balls (blocks) having a
common source (fragment). A collision happens when two or
more balls belonging to the same redundancy group are
allocated to the same bin.
We recognize two sources that have inspired de design of
our oracle: RUSH [23] and RS [19]. In RUSH (Replication
Under Scalable Hashing), bins are grouped to form
subclusters. Each time the system requires to scale up its
overall capacity a new subcluster is attached. Therefore, to
find the bin where a given ball is accommodated, RUSH
proceeds in two steps. First, identifying the subcluster to
place the ball and second, appointing a bin within the given
subcluster. Balls are mapped to bins using prime number
arithmetic that guarantees that not two of them are allocated
to the same bin. In turn, RS (Random Slicing) uses a hash
function to map each object to a point in the [0.0, 1.0)interval.
At the same time the working interval is
partitioned into smaller, non-overlapping, intervals assigned
to the bins currently in use, according to their relative
capacities. Each time a new bin is incorporated, the interval
is remapped over the extended set of bins.
RUSH and RS excel to provide balance and time
efficiency. The problem with RUSH is that it is necessary to
keep track of the prime number and the sub-cluster size
supporting each allocation. This fact may have a major
impact on metadata management, when dealing with massive
scale storage capacities. In contrast, the down-side of RS is
that it only provides an upper bound on the probability of
collision, which is inversely proportional to the number of
available bins.
Our solution is based on a combination of the principles
of RUSH and RS. Initially, the overall storage capacity is
evenly divided into v subsets of bins, P0, …, Pv-1, also called
pools. Each pool has an initial capacity b0, and is managed
as an individual instance of RS. If we assume a cyclic or
individual failure time equal to 20,000 hrs, spare
components s = 3 and, active components v = 5.
In contrast, figure 3 shows the shortest MTTF obtained.
Again, we compare the experimental distribution, in solid, to
the fitted pdf, in dotted. This time, both have a mean time
equal to 8.85 years. The corresponding parameters are: mean
repairing time equal to 20 hrs, mean individual failure time
equal to 5000 hrs, spare components s = 1, active
components v = 7.
Fig. 2. Longest MTTF of the System
Fig. 3. Shortest MTTF of the System
B. The oracle
An oracle is required to answer a very simple question: in
which of the many devices that make up the system, should a
given block be allocated? Notice that this question is issued
at different moments during a block’s lifetime, i) when it is
uploaded, ii) every time it is retrieved. Nevertheless, as we
mentioned before, the system experiences changes on its
initial composition and it is quite possible that the device
where a given block has been initially allocated, it is not
necessarily the permanent place from which it is going to be
retrieved. Indeed, there is a third condition to invoke the
oracle, iii) when a new storage device has been incorporated
either to replace a faulty one, or to scale up the overall
system capacity. On this last circumstance, the oracle helps
the system to migrate blocks in order to recover its load
balance.
238
246
In the current state of the system it is enough with the name
of the user (Mary), the name of the given file (report.txt), its
size, the definition of the parameter MSU (which is a single
value that is applied in every operation), and the information
redundancy technique that it is applied to produce the final
blocks (0: replication, 1: IDA), which is linked to the user’s
profile. Notice that it is not necessary to record neither the
number of blocks produced by the initial source file, nor the
devices where they were allocated. This information can be
calculated. Hence, the signature of each block is built and
submitted to an instance of the oracle. This principle is what
we call a decoupling between data and metadata.
revolving order on the pools IDs, we can say that pool P0 is
the successor of pool Pv-1.
Each time the system is about to reach its overall
capacity, a new generation of v bins has to be attached, on
two conditions: i) all the bins of the same generation have
the same capacity and ii) each pool receives a new bin. The
procedure to accommodate the relative capacities of the bins
belonging to the same pool will be exactly the same as in RS.
This means that the intervals preserve their length despite of
the increase on the associated pool storage capacity. Notice
that, due to the initial settlement, properties i&ii produce v
identical pools regardless of the number of generations of
bins that have been attached to scale up the system.
Let us assume that a redundancy group of r v objects,
O0, … ,Or-1, has to be accommodated on a system of v pools.
We map object O0 to a given pool ID (using a pseudo
random function). It is known that, for any number p which
is a prime relative of v, we can make up a permutation of the
set {0, …, v-1}, starting from i, to appoint the successive
pools where the remaining objects should be allocated. In
other words, object Oj is charged to pool P(i+jxp %v) , for
j=0,…, r-1. This simple mechanism grants the impossibility
of collisions. Also, notice that if p=1, we will allocate the
group on r successive pools, starting from Pi. Finally, an
appointed pool stores the ball that receives as it would
proceed in RS, which means that it maps the ball to a number
in the working interval [0.0, 1.0) and then, it finds out the
bin behind the given number. Since we have v identical
pools, this final calculation has to be performed once,
independently from the group size.
We developed a study to investigate the overall load
migration, whenever a new generation of bins has to be
attached. We assume that RS-Pools starts with v={5,6,7}
pools, and each pool has an initial bin with capacity b0=1
TiB. When the k-th generation of v new bins is introduced,
each bin has a capacity bk = 1.5 bk-1. We also assume that
each redundancy group is made up with R={3,4,5} balls,
each with size equal to 1 MiB. To compare with RUSHp
under similar circumstances, we consider that, each time the
system is about to scale up, a subcluster with v new bins is
attached. Therefore, the overall capacity on each new bin
generation is exactly the same for either RS-Pools, or
RUSHp.
Results presented in figures 4 and 5, show the overall
load migration after a new bin generation is attached and the
system recovers its balance. For either RS-Pools or RUSHp,
the system settles down to its long term level when the 6th
bin generation is introduced. In the case of RS-Pools, it
reaches asymptotically a limit which is 0.33. Also, we
observe that this behavior is independent from the number of
pools or the number of redundant balls, i.e. the redundancy
group size. Meanwhile, RUSHp stabilizes above 0.73, it is
very sensitive to the group size and, in the long term, slightly
sensitive to the total number of active bins.
Fig. 4. RS-Pools reallocation rate.
Fig. 5. RUSHp reallocation rate.
Let us assume now that the proxy faces one of the
following conditions: it receives a big number of requests
within a very short window of time, or it suddenly interrupts
its operations and it is considered out of service. In order to
tolerate any of the above conditions, we should deploy a
redundant set of proxies but also, a replicated and consistent
metadata file on each of them. It is known that the protocol
Paxos offers a very efficient procedure to build a consistent
recording of a replicated database. The approach followed
by the also called part-time parliament protocol, consists in
solving the consensus problem by means of an appointed
leader and a quorum of active entities, supporting persistence
of the values already decided.
C. Paxos and medata consistency
What is the metadata that must be recorded at the proxy
in order to support the recovery of a file, previously stored?
239
247
We have developed an implementation of Paxos,
currently under validation. As it is stated in the original
proposal, the appointed leader enforces the accomplishment
of the safety conditions. Meanwhile, liveness is guaranteed
provided that there is exactly one leader. For this purpose we
have implemented a heart-beat mechanism that triggers a
new election procedure, when the current leader interrupts its
pulses. The FLP theorem [31] shows the impossibility of
consensus under an asynchronous communication model.
The heart-beat mechanism is an alternative to overcome the
limitations due to the lack of a universal clock.
The team that built GFS developed its proprietary
implementation of Paxos[24] called Chubby. They mention
that there is a long road from the description that appears on
the initial paper, to the final programming of Chubby. This
road passes by several stops including modeling, validation,
as well as performance testing.
VI. CONCLUSIONS AND FURTHER WORK
On this paper we have briefly introduced the design
considerations that have led the construction of the Babel
File System. This is a large scale, highly dependable storage
system, based on a rather flexible set of components. We
consider that our proposal can be understood as a lego-type
family of solutions that can be easily fitted to different
applications. Accordingly, people in charge of storage
systems deployment may consider the possibility of using
Babel, as they have enough leeway to settle a trade-off
between price and performance, depending on their
particular priorities.
We have carefully addressed three basic conditions that,
from our view, will provide the foundations of an effective
and long lasting system: i) reliability, ii) scalability, and iii)
service times. Redundancy is the key to reliability, which
grants service continuity. In turn, service continuity means
that the system should be interrupted at least as possible, not
only during failures, but also when the system reaches its
limit and new components must be attached to scale up the
overall storage capacity. Each time the system grows, it
enters into a transitory condition where some of the objects
(blocks) already stored have to be reallocated in order to
recover load balance. A thoughtful design must consider two
critical issues during this stage. First, the system should
reassign as least blocks as possible. Second, metadata
updating should be kept in mind. Finally, service times may
also profit from the fact that storage devices have processing
capacities. Therefore, as there exists an important number of
such devices, parallelism (and load balance) is fundamental
to achieve short service times.
To show the potential of Babel, we developed a couple of
client applications that base their work on the storage
capacities supported by our system. We built a storage
service that provides the interface of a WebDAV server,
based on the IETF RFC4918. On the other hand, we built a
PACS (Picture Archiving and Communications System),
according to the NEMA DICOM standard, also known as the
ISO12052:2006 standard. Among the lessons that we learned
from these accomplishments, we realized that Babel has a
flexible interface that can be easily extended to fit the
requirements of high volume storage-based services such as
authoring and versioning systems, or those oriented to
manage high volumes of heavy image files.
We are currently working on different directions for
immediate work. Among the strategic issues that we are
about to address we consider the deployment of a parallel
query platform over the set of processing nodes that make up
the storage cell [37]. A second issue is the possibility of
building a federation of storage cells. As this federation may
grow, we observe that it would be difficult to have a
centralized control. To deal with this challenge, we consider
that P2P systems may provide us with interesting ideas that
could be adapted to our needs. For instance, we are
considering the possibilities of building semantic capacities
on top of a cell federation resembling a P2P network [38,
39].
V. THE POSSIBILITIES OF BABEL
In this section we shortly describe 2 different applications
that we developed based on the storage capacities supported
by Babel. Once in operation, Babel can be understood as a
black-box having a simple command line interface that can
be extended to fit the requirements of different services such
as the WebDAV Server or the Picture Archiving and
Communications System (PACS), that we are about to
introduce.
The Web Distributed Authoring and Versioning Protocol
(WebDAV), defined by the IETF RFC 2518, 3252 & 4918,
is an extension of the Hypertext Transfer Protocol (http) that
turns a web site into repository with reading and writing
capabilities, where authors find support for file management
operations, such as those available in an ordinary file system,
i.e. files and directories creation, changes, erasure, protection
against overwriting, etc. We built an initial cloud storage
service that evolved and turned into a WebDAV server, on
top of the Babel interface. It is quite interesting to mention
that we tested this service a couple of commercial WebDAV
[32, 33] clients, including BitKinex and AnyClient.
In turn, a Picture Archiving and Communications System
(PACS) [34], defined by the NEMA DICOM standard, also
known as the ISO12052:2006 standard, is designed to
articulate the many different devices involved on the
production, display, storage, retrieval and printing of medical
image files. Our implementation is based on the PixelMed
toolkit [35], which is a free and open source library.
We built a PACS storage server that supports a subset of
the services described according to the DICOM conformance
specifications [36]. Our design is based on the PixelMed
toolkit, which is a set of free, libre and open source libraries
implementing code for reading and creating data, network
and file support, object database management, display of
directories, images, reports and spectra, and object
validation. The architecture that we propose allows a high
cohesion and low coupling since it is simple to replace the
communication with any database handler. On the other side,
the integration of http ensures a portable communication
with the interface of Babel.
REFERENCES
240
248
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
S. Ghemawat, H. Gobioff, and Shun-Tak Leung. “The Google file
system,” SIGOPS Oper. Syst., Oct. 2003, pp. 29-43,
doi:10.1145/1165389.945450
M.. Palankar, A. Iamnitchi, M. Ripeanu and S. Garfinkel. “Amazon
S3 for science grids: a viable solution?,” Proc. International
workshop on Data-aware distributed computing (DADC '08), ACM,
June 2008, pp 55-64, doi:10.1145/1383519.1383526
CloudNAS, http://en.wikipedia.org/wiki/Nirvanix, 2014
Skydrive, http://skydrive.live.com, 2014
I. Ion, N. Sachdeva, P. Kumaraguru and S. apkun, “Home is safer
than the cloud!: privacy concerns for consumer cloud storage,” In
Proceedings of the Seventh Symposium on Usable Privacy and
Security
(SOUPS
'11),
Article
13
,
20
pages,
doi:10.1145/2078827.2078845.
E. Walker, W. Brisken and J. Romney, “To Lease or Not to Lease
from Storage Clouds,” Computer , vol.43, no.4, pp.44-50, April 2010
doi: 10.1109/MC.2010.115
J. L. Gonzalez and R. Marcelin-Jimenez., "Phoenix: A Fault-Tolerant
Distributed Web Storage Based on URLs," Parallel and Distributed
Processing with Applications (ISPA), IEEE 9th International
Symposium on, pp. 282-287, May 2011, doi: 10.1109/ISPA.2011.33
E. Chai , M. Uehara, M. Murakami, M. Yamagiwa, "Online Web
Storage Using Virtual Large-Scale Disks," Complex, Intelligent and
Software Intensive Systems (CISIS '09), International Conference on ,
vol., no., pp.512,517, 16-19 March 2009, doi: 10.1109/CISIS.2009.74
J. LBorges. El jardín de los senderos que se bifurcan; Editorial Sur,
1941.
R. O. Weber, “Information Technology – SCSI object-based storage
device commands (OSD),” Technical Council Proposal Document
T10/1355-D, Technical Committee T10.
R. J. Honicky and E. Miller, “A Fast Algorithm for Online Placement
and Reorganization of Replicated Data,” Parallel and Distributed
Processing
Symposium,
2003,
pp.,
22-26.
doi:
10.1109/IPDPS.2003.1213151
S.A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C.
Maltzahn, “Ceph: a scalable, high-performance distributed file
system,” In Proceedings of the 7th symposium on Operating systems
design and implementation (OSDI '06), 2006, pp. 307-320.
K. Shvachko, H. Kuang, S. Radia and R. Chansler, "The Hadoop
Distributed File System," Mass Storage Systems and Technologies
(MSST), IEEE 26th Symposium on, pp. 1-10, May 2010
doi: 10.1109/MSST.2010.5496972
P. Schwan, “Lustre: Building a file system for 1000 node closters,”
Symposium, Linux, 2003.
Gluster Community, http://www.gluster.org, 2014.
J. Lee, B. Tierney and W. Johnston, “Data Intensive Distributed
Computing: A Medical Application Example,” 7th. International
Conference on High-Performance Computing and Networking
(HPCN Europe),1999, pp. 150-158
B. L. Tierney, J. Lee, B. Crowley and M. Holding, “A NetworkAware Distributed Storage Cache for Data Intensive Environments
High Performance Distributed Computing Conference (HPDC ’99),
1999, Pages 185-193.
Z. Ali and Q. Malluhi, “NSM: A Distributed Storage Architecture for
Data-Intensive Applications,” 20th IEEE/11th NASA Goddard
Conference on Mass Storage Systems and Technologies (MSS ‘03),
2003, Page 87.
A. Miranda, S. Effert, Y. Kang, E. L. Miller, A. Brinkmann, T.
Cortes, "Reliable and randomized data distribution strategies for large
scale storage systems," High Performance Computing (HiPC), 2011
18th International Conference on, pp. 18-21, Dec. 2011, doi:
10.1109/HiPC.2011.6152745
K. E. Wiegers, “Software Requirements 2: Practical techniques for
gathering and managing requirements throughout the product
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
241
249
development cycle,” 2nd ed., 2003, Redmond: Microsoft Press. ISBN
0-7356-1879-8.
A. Stellman and J. Greene, “Applied Software Project Management.
Cambridge,” MA: O'Reilly Media. ISBN 0-596-00948-8.
Ian Sommerville, “Software Engineering,” 8th ed., 2008, AddisonWesley, ISBN 0-321-31379-8.
R. J. Honicky, E. L. Miller, "Replication under scalable hashing: a
family of algorithms for scalable decentralized data distribution,"
Parallel and Distributed Processing Symposium, 2004. Proceedings.
18th
International,
pp.
26-30,
April
2004
doi:
10.1109/IPDPS.2004.1303042
T. D. Chandra, R. Griesemer and J. Redstone. Paxos made live: an
engineering perspective, In Proceedings of the twenty-sixth annual
ACM symposium on Principles of distributed computing (PODC '07),
ACM, 2007, pp. 398-407 doi:10.1145/1281100.1281103
L. Lamport, “Paxos made simple,” ACM SIGACT news distributed
computing column 5 (SIGACT) News 32, 4 , Dec. 2001, pp. 34-58.
doi:10.1145/568425.568433
M. O. Rabin, “Efficient dispersal of information for security, load
balancing and fault tolerance,” Journal of the ACM, vol. 36(2), pp.
335-348, April 1989, doi:10.1145/62044.62050
H. Weatherspoon and J. Kubiatowicz, “Erasure Coding Vs.
Replication: A Quantitative Comparison.,” In Revised Papers from
the First International Workshop on Peer-to-Peer Systems (IPTPS
'01), 2002, pp. 328-338.
M. Quezada-Naquid, R. Marcelín-Jiménez and M. Lopez-Guerrero,
“Fault Tolerance and Load Balance Tradeoff in a Distributed Storage
System,” Computación y Sistemas, October-December 2010, vol. 14,
num. 2, pp. 151-163.
A. Varga, http://www.omnetpp.org, 2014.
M. Mitzenmacher and E. Upfal, “Probability and Computing:
Randomized Algorithms and Probabilistic Analysis,”. Cambridge
University Press, 2005.
Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson.
“Impossibility of distributed consensus with one faulty process,” J.
ACM, 1985, 32, 2, pp. 374-382, doi: 10.1145/3149.214121.
E. J. Whitehead Jr. and Y. Y. Goland, “WebDAV: a network protocol
for remote collaborative authoring on the Web,” In Proceedings of the
sixth conference on European Conference on Computer Supported
Cooperative Work (ECSCW '99), Norwell, pp. 291-310.
P. Gambarotto and P. Aubry, “ESUP-Portail: a pure WebDAV-based
Network Attached Storage,” EUNIS2004, Bled, Slovenia, July 2004.
H.K. Huang, “PACS and Imaging Informatics, Basic Principles and
Applications,” (2nd. Ed.) Wiley-Blackwell, 2010.
D. A. Clunie, “PixelMed publishing,” http://www.pixelmed.com/,
July 2013.
O.S. Pianykh, “Digital Imaging and Communications in Medicine
(DICOM), A Practical Introduction and Survival Guide,” Springer,
2011.
J. L. Gonzalez, J. Carretero Perez, V. J. Sosa-Sosa, J. F. Rodriguez
Cardoso and R. Marcelin-Jimenez, “An approach for constructing
private storage services as a unified fault-tolerant system. J. Syst.
Softw.
Vol.
86,
pp.
1907-1922,
July
2013,
doi:10.1016/j.jss.2013.02.056
D. Bermbach, M. Klems, S. Tai and M. Menzel, “MetaStorage: A
federated cloud storage system to manage consistency-latency
tradeoffs, Proceedings of the 2011 IEEE 4th International Conference
on Cloud Computing (CLOUD '11), IEEE Computer Society, 2011,
pp. 452-459, doi:10.1109/CLOUD.2011.62.
R. Ranjan, R. Buyya and A. Harwood, “A model for cooperative
federation of distributed clusters,” Proceedings of the High
Performance Distributed Computing (HPDC-14), 2005, pp. 295-296,
doi: 10.1109/HPDC.2005.1520982.