2014 IEEE International Congress on Big Data The Babel File System Moisés Quezada-Naquid 1, Ricardo Marcelín-Jiménez 1,3, José Luis González-Compeán 2 1 2 Department of Electrical Eng. UAM-Iztapalapa México D.F., MEXICO [email protected], [email protected] Dept. of Systems and Computation Cd. Valles Institute of Technology Cd. Valles, San Luis Potosí, MEXICO [email protected] 3 Collaborates with the research team at the Centro Público de Información y Documentación para la Industria (INFOTEC) is a redundant set of servers, also called proxies, coordinating a vast number of storage devices, whose joined capacities can scale up to the order of petabytes. We have built a dependable system that is oriented to support many different applications. It is important to notice that our proposal is able to integrate different operating systems and storage technologies, into a unified framework. Users are not, and do not require to be, aware of the underlying technologies that support their services. This approach is known as “objectbased storage systems” [10,11]. In this paper we describe the design process that we followed and the lessons that we have learned during the development of the Babel File System. Our aim is to offer a competitive and cost-effective alternative for in-house large scale storage. Up to date, we have deployed 2 services based on our platform: a WebDAV server and a biomedical image storage service (PACS: Picture Archiving and Communication System). The rest of this paper includes the following parts: in section II, we present some of the most important storage systems that have influenced our work and review the lessons we learned from these leading designs. In section III, we describe the functional and non-functional requirements that guided our work. In section IV we introduce the most important operational entities that came out from the analysis stage. In section V, we shortly introduce 2 applications that we developed, based on the storage capacities supported by Babel. Finally, in section VII, we summarize our design decisions and present the current ongoing work that Babel has triggered. Abstract— The Babel File System is a dependable, scalable and flexible storage system. Among its main features we underline the availability of different types of data redundancy, a careful decoupling between data and metadata, a middleware that enforces metadata consistency, and its own load-balance and allocation procedure which adapts to the number and capacities of the supporting storage devices. It can be deployed over different hardware platforms, including commodity hardware. Our proposal has been designed to allow developers to settle a trade-off between price and performance, depending on their particular applications. Keywords: BFS, massive cluster-storage, dependability, scalability, flexibility, parallelism, commodity hardware. I. INTRODUCTION The virtualization of fault-tolerant storage systems has become a common solution that provides both data availability and reliability in many applications. This is the case of web storage services such as GFS [1], Amazon S3 [2], Nirvanix CloudNAS [3], or Microsoft SkyDrive [4] (among others), which are successful business offering permanent file availability. Nevertheless, it has been observed that the cost of external web storage is several times higher than the value of the involved storage components [5, 6]. As information is becoming a very important asset, organizations started considering the possibility of building their own large scale storage capacities as a promising alternative to keep the control of this valuable resource. This decision is fostering studies that aim to develop new storage systems based on a rather flexible set of components as well as the guidelines for configuring these environments in a cost-effective manner [7, 8]. Over the last years we have been working on the construction of a storage solution, called the Babel File System (after “The Library of Babel”, a short story by Jorge Luis Borges included in “The garden of the forking paths” [9]), supporting features that can be found on similar systems such as the aforementioned. A storage cell, as we call to an instance of Babel, is made up from a set of devices, connected by means of a local area network. Users do not perceive but a single virtual device which supports file upload and download operations. Behind this interface there 978-1-4799-5057-7/14 $31.00 © 2014 IEEE DOI 10.1109/BigData.Congress.2014.42 II. RELATED WORK In this section we will present those systems that have inspired our work. It is important to point out that, by no means, we present an exhaustive search. Instead, we consider that the following is a very representative list, including the most important milestones on this trend. Generally speaking, the systems that we review advocate the construction of a virtual storage device based on a set of storage components where redundant information is stored. Under this approach, the failure of an individual component does not immediately cancel the availability of a given file, 242 234 a client and server component. Servers are typically deployed as storage bricks, with each server running a glusterfsd daemon to export a local file system as a volume. The GlusterFS client process integrates composite virtual volumes from multiple remote servers using stackable translators. By default, a file is stored as a single information unit, but striping of files across multiple remote volumes is also supported. The GlusterFS server is kept minimally simple: it exports an existing file system as-is, leaving it up to client-side translators to structure the store. The clients themselves are stateless, do not communicate with each other, and are expected to have translator configurations consistent with each other. GlusterFS relies on an elastic hashing algorithm, rather than using either a centralized or distributed metadata model. With version 3.1 and later of GlusterFS, volumes can be added, deleted, or migrated dynamically, helping to avoid coherency problems, and allowing GlusterFS to scale up on commodity hardware by avoiding bottlenecks that normally affect more tightly-coupled distributed file systems. Among the lessons that we have learned [16, 17, 18, 19] we underline the following principles: since the remaining components do have sufficient information in case it is required to rebuild the file. Storage devices are “hidden” to the final user, or at least, they are not the first component to contact. • CEPH is a distributed storage system initially developed at the University of Santa Cruz [12]. It is oriented to support massive volumes of scientific data. CEPH design considers two principles: it does not exist an entry table to find the place where a file has been allocated. Instead, this place is calculated by means of a pseudo-random function. When a user contacts the system in order to retrieve a previously stored file, she can be assigned to any available server. This second principle means that servers do not control a fixed set of storage nodes. • GFS was designed by Google Inc. in order to support the massive storage of information generated or collected by Google itself [1]. Among the principles that we highlight from this system, we find that servers are in charge of control and monitoring tasks in order to detect failures, trigger repairing procedures and tuning system performance. Balance is enforced using a chunk size, which defines the maximum length of an information unit to be allocated. If a file exceeds this parameter it will be split up into as many chunks as necessary to guarantee that each resulting chunk is smaller or equal to this maximum size. • The HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework [13]. Each node in a Hadoop instance typically has a single data node. A cluster of data nodes form the HDFS cluster. Each data node serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication; clients use RPC to communicate between each other. The HDFS stores large files (an ideal file size is a multiple of 64 MB), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication above a given service level agreement. The HDFS requires one unique server, the name node. This is a critical point of failure for an HDFS installation. • The Lustre file system architecture was developed as a research project at Carnegie Mellon University [14]. A Lustre file system has the following major functional units: A single metadata server having a single metadata target (MDT) per Lustre filesystem, where namespace metadata is kept, one or more object storage servers (OSS) that store file data on one or more object storage targets (OST) and clients. • GlusterFS is based on a stackable user space design without compromising performance [15]. It has found a variety of applications including cloud computing, biomedical sciences and archival storage. GlusterFS has 1) For systems built from a large number of parts, failures must be regarded as a common issue. Under this assumption, the design process must consider different types of redundant resources that bail out the system from transient failures. Also, systems must include monitoring or supervision capabilities, as well as selfrepair functions. 2) Massive storage systems have an interface that offers a unified entry point. Behind this interface, there is a set of servers that filter each incoming request and manage individual storage devices, where redundant information is finally allocated. 3) Most of the individual storage devices also include processing capabilities. Therefore, those processing functions that make part of the storage service can be trusted on these very devices. With this decision, a high degree of parallelism can be achieved, which has a potential benefit on service times and availability. This decision implies the definition of 2 key parameters: a maximum length processing unit (also called chunk or fragment size), and a maximum length storage unit (also called block size). In turn, parallelism calls for a very resourceful design including load balance procedures for either processing or storage units. 4) Under very demanding conditions, designers must consider the possibility of using high speed temporal devices that work as buffers between the application clients and the final storage device. 5) In order to address reliability concerns, it appears to be convenient a decoupling between data and metadata. Data refers to the storage units that code the users´ information. Meanwhile, metadata refers to the 235 243 between her computer and a given logical volume, also called storage node. This node is selected by the proxy, according to a random selection procedure. It also records this operation on the current log and creates a new entry on its database (metadata), in order to support the future retrieval. information the system is required to keep in order to retrieve data. The former is allocated on the final storage devices, while the latter is kept on the system servers. Notice that, a redundant set of servers implies a potential inconsistency on the replicated information that each server has recorded. Also notice that, as the load balance procedure is invoked, metadata may become out of date. 4. III. SYSTEM OVERVIEW AND DESIGN COSIDERATIONS In this section we present the functional and non-functional requirements that guided our design [20, 21, 22]. Functional requirements are explained in terms of their corresponding use cases. As for non-functional requirements, we present a list of the quality issues that have been taken into account. At the same time, we introduce, on the fly, the key processing units that translate these use cases into concrete operations and also describe the way these units impact on the system’s quality and performance. It is important to notice that, in a highly dependable system such as Babel, functional and non-functional requirements are tightly intertwined, as the reader has probably realized. Use cases can be divided in two families: the first one supports users’ account management and includes account creation, cancellation, connection and disconnection. The second one involves resource management and includes file upload, download, delete, disk replacement, scaling and balance. For the sake of brevity we will now focus on the upload case, since we consider that involves most of the parameters and processing units that take part on each of the remaining cases. 5. 6. 7. 8. Figure 1. Babel upload sequence. Let us suppose that a given user, say Mary, wants to upload a file. As we describe the steps required to fulfill her service, we will introduce the components of the system. 1. 2. 3. Now, the storage node starts receiving the file stream. To foster processing balance, the system defines a parameter called maximum storage unit (MSU). Files are split up into as many processing units, or fragments, as necessary to guarantee that the length of each fragment does not exceed the given MSU. Each storage node maintains a revolving or cyclic list called ring, with the identities of the storage nodes that collaborate with it. According to the order they appear in the ring, the node in charge allocates each of the resulting fragments to its collaborators, starting from the last appointed place on. To achieve fault-tolerance and high availability, fragments undergo a redundancy generation procedure. The system supports two different procedures: either simple replication or the information dispersal algorithm (IDA) [23]. Depending on a service level agreement, which is related to Mary's profile, the node that has received a fragment selects the corresponding procedure. Replication creates two identical copies of the fragment, called blocks. Instead, IDA creates a set of k different storage units, also called blocks, such that the original fragment can be recovered provided that any m blocks are available. Whether the node uses replication or IDA, the resulting blocks are accommodated invoking a local function called oracle. This is a deterministic function that calculates a storage node id, where the block will be finally allocated. Each block has a fixed set of properties that we refer to as its signature. An oracle maps each block signature into a storage node id. Notice that it is possible to retrieve a block, provided that its corresponding signature is known and an instance of the oracle is available, regardless of the place where the oracle is invoked. A node which is required to process or store an information unit (file, fragment, or block), confirms the completion of this task to the immediate source that has required its capacity. A proxy keeps a replica of its metadata on each of the proxies that make up the cell interface. In a similar way, a storage node replicates its metadata on a couple of nodes that make part of its ring. In any case, metadata consistency is an important aspect to care about. Three basic considerations must be taken into account to build a dependable storage system: availability, faulttolerance and scalability. In order to fulfill these goals, a thorough design should consider information redundancy techniques and data placement strategies. The former refer to Mary contacts the cell interface. An appointed server or proxy validates her as an authorized user. As Mary submits her file, the proxy creates a stream 236 244 the methods that produce redundant information to support availability and fault-tolerance. The latter refer to the (re)allocation of data within the available storage devices, to support changes on the initial configuration. Redundant information, as we already presented, can be obtained from simple replication that generates k copies, called storage units or blocks, of each original fragment (or processing unit). This is the principle followed by GFS [1] and HDFS [13]. Alternatively, redundancy can also be produced using error-correcting techniques. For instance, using the information dispersal algorithm (IDA) which is a linear transformation over a finite field. In turn, a data placement strategy, or oracle, determines the storage device where a given block, has to be allocated [19, 23]. This function should be supported efficiently and considering the dynamics of the overall system, including new devices that come into operation, or old devices that experience temporal or permanent failures and are eventually replaced. The non-functional requirements that we have considered are: cost, reliability/fault-tolerance, scalability, availability, modularity and interface. • • • • • • • A. The information dispersal algorithm The IDA[26, 27] achieves fault tolerance by means of information redundancy. Let F be a fragment, F is transformed into k files called dispersals. Each of size |F|/m, where k > m > 1. Then, the dispersals are handed over to a set of disks, i.e. each disk stores one of the k dispersals of F. From the algorithm properties it is granted that if any k - m dispersals are lost, the original information can be reconstructed from the m surviving dispersals. Due to these properties, not only can a fragment F be recovered from any m of its blocks, but also any missing block can be recovered provided that m blocks remain available. In this case, the cost of the reconstruction can be accepted due to the increased storage capacity and superior fault-tolerance. A very important matter that we addressed is the quest of what we could call a “good” combination of IDA parameters (k,m). Let us recall that each dispersal is 1/m the size of F and, therefore, there is an excess of information, or information redundancy, equal to (k-m)/m the size of F. In our implementation we consider that the combination (5,3) is a good trade-off between fault-tolerance and redundancy The system performs two different types of operations: i) user operations (i.e., file storage, retrieval and erasing) which only involve active components and ii) system operations including account management and the recovery procedure. The latter is triggered when an active component crashes and a spare component is selected to replace the faulty one When the recovery procedure starts, a spare component is appointed in order to replace the one that failed. As part of this process all pertinent data from other components has to be retrieved in order to fully reconstruct the information previously available in the missing device. We developed a performance study to evaluate the mean time to failure of our system and the impact of different factors on this metric of performance [28]. For this goal we built a discrete event simulator using OMNeT++ [29]. We considered the following assumptions: i) the system is working below its maximum capacity, ii) the time to recover a dispersal is linearly dependent on the size of the missing dispersal, iii) any user operation is interrupted during recovery, iv) the time to failure of all storage components is modeled by independent and identically distributed negative exponential random variables, v) repair times are also represented by independent and identically distributed negative exponential random variables. For k = 5, our experiment design included 4 parameters: a) repair time, b) MTTF of individual storage components, c) initial number of spare components and, d) initial number of active components. We met a cost-effective building solution that can be assembled even from commodity components. As we mentioned before, the key to reliability/faulttolerance is the utilization of redundant resources, i.e. redundant devices, redundant information and redundant processing capacities. Scalability has been accomplished designing an adaptive oracle that enforces storage balance and adapts its work to a growing number of storage devices. Availability, as we understood, is achieved using parallelism either to process information, or to store and retrieve redundant blocks. Modularity is solved articulating a loosely-coupled set of modules (and plug-ins) that can be modified independently from each other. Portability refers to the possibility of deploying the system over different hardware platforms. We understood the necessity of a multiplatform programming language and we chose python as the fundamental building tool. Finally, we devised a small and simple command line interface that can be easily extended to match with different client applications, supported by a GUI or a Web interface. IV. THE KEYS TO BABEL Besides the very software design process and the resulting blueprints, we consider that the major contributions of our project are the following operational units: i) a parameterized IDA module that can be easily adapted to different conditions, ii) our own oracle design and its corresponding module implementation, iii) an implementation of the Paxos protocol [24, 25], enforcing metadata consistency, Figure 2, shows the longest MTTF obtained. In solid line we see the corresponding experimental histogram. Besides, we compare this result to a fitted exponential negative pdf, in dotted line. Both have a mean time equal to 4,981.60 years. This case corresponds to the following combination of parameters: mean repairing time equal to 5 hrs, mean 237 245 It is quite important to notice that, although a given block may change of allocation during its lifetime, there exists a set of attributes, that we call the signature of the block, that feature this unit and that never change, such as its name or date of creation, for instance. Therefore, as we have already stated, the oracle receives the signature of a block, in order to calculate its current location. The oracle adapts its function to the actual number of storage devices, but the metadata, required for storage and retrieval, remains without changes. This principle eases the burden of metadata management. Let us suppose that Mary wants to store a file “report.txt”. Considering its size and the definition of the parameter MSU, the file is split into 2 fragments: “USRMary:report.txt.f1” and “USRMary:report.txt.f2”. Now let us assume that each fragment is processed using an instance of IDA, with parameters (5,3). If we want to know the position, i.e. the device where the 3rd block (dispersal) of the 2nd fragment should be allocated, we introduce its signature “USRMary:report.txt.f2.b3” to the oracle, which is charged to return the identity of the final device. It is a common practice to study the object allocation problem using a “bins and balls” model, borrowed from probability theory [30]. Storage devices are regarded as bins, and redundant data units, or objects, as balls. We call a redundancy group to the set of balls (blocks) having a common source (fragment). A collision happens when two or more balls belonging to the same redundancy group are allocated to the same bin. We recognize two sources that have inspired de design of our oracle: RUSH [23] and RS [19]. In RUSH (Replication Under Scalable Hashing), bins are grouped to form subclusters. Each time the system requires to scale up its overall capacity a new subcluster is attached. Therefore, to find the bin where a given ball is accommodated, RUSH proceeds in two steps. First, identifying the subcluster to place the ball and second, appointing a bin within the given subcluster. Balls are mapped to bins using prime number arithmetic that guarantees that not two of them are allocated to the same bin. In turn, RS (Random Slicing) uses a hash function to map each object to a point in the [0.0, 1.0)interval. At the same time the working interval is partitioned into smaller, non-overlapping, intervals assigned to the bins currently in use, according to their relative capacities. Each time a new bin is incorporated, the interval is remapped over the extended set of bins. RUSH and RS excel to provide balance and time efficiency. The problem with RUSH is that it is necessary to keep track of the prime number and the sub-cluster size supporting each allocation. This fact may have a major impact on metadata management, when dealing with massive scale storage capacities. In contrast, the down-side of RS is that it only provides an upper bound on the probability of collision, which is inversely proportional to the number of available bins. Our solution is based on a combination of the principles of RUSH and RS. Initially, the overall storage capacity is evenly divided into v subsets of bins, P0, …, Pv-1, also called pools. Each pool has an initial capacity b0, and is managed as an individual instance of RS. If we assume a cyclic or individual failure time equal to 20,000 hrs, spare components s = 3 and, active components v = 5. In contrast, figure 3 shows the shortest MTTF obtained. Again, we compare the experimental distribution, in solid, to the fitted pdf, in dotted. This time, both have a mean time equal to 8.85 years. The corresponding parameters are: mean repairing time equal to 20 hrs, mean individual failure time equal to 5000 hrs, spare components s = 1, active components v = 7. Fig. 2. Longest MTTF of the System Fig. 3. Shortest MTTF of the System B. The oracle An oracle is required to answer a very simple question: in which of the many devices that make up the system, should a given block be allocated? Notice that this question is issued at different moments during a block’s lifetime, i) when it is uploaded, ii) every time it is retrieved. Nevertheless, as we mentioned before, the system experiences changes on its initial composition and it is quite possible that the device where a given block has been initially allocated, it is not necessarily the permanent place from which it is going to be retrieved. Indeed, there is a third condition to invoke the oracle, iii) when a new storage device has been incorporated either to replace a faulty one, or to scale up the overall system capacity. On this last circumstance, the oracle helps the system to migrate blocks in order to recover its load balance. 238 246 In the current state of the system it is enough with the name of the user (Mary), the name of the given file (report.txt), its size, the definition of the parameter MSU (which is a single value that is applied in every operation), and the information redundancy technique that it is applied to produce the final blocks (0: replication, 1: IDA), which is linked to the user’s profile. Notice that it is not necessary to record neither the number of blocks produced by the initial source file, nor the devices where they were allocated. This information can be calculated. Hence, the signature of each block is built and submitted to an instance of the oracle. This principle is what we call a decoupling between data and metadata. revolving order on the pools IDs, we can say that pool P0 is the successor of pool Pv-1. Each time the system is about to reach its overall capacity, a new generation of v bins has to be attached, on two conditions: i) all the bins of the same generation have the same capacity and ii) each pool receives a new bin. The procedure to accommodate the relative capacities of the bins belonging to the same pool will be exactly the same as in RS. This means that the intervals preserve their length despite of the increase on the associated pool storage capacity. Notice that, due to the initial settlement, properties i&ii produce v identical pools regardless of the number of generations of bins that have been attached to scale up the system. Let us assume that a redundancy group of r v objects, O0, … ,Or-1, has to be accommodated on a system of v pools. We map object O0 to a given pool ID (using a pseudo random function). It is known that, for any number p which is a prime relative of v, we can make up a permutation of the set {0, …, v-1}, starting from i, to appoint the successive pools where the remaining objects should be allocated. In other words, object Oj is charged to pool P(i+jxp %v) , for j=0,…, r-1. This simple mechanism grants the impossibility of collisions. Also, notice that if p=1, we will allocate the group on r successive pools, starting from Pi. Finally, an appointed pool stores the ball that receives as it would proceed in RS, which means that it maps the ball to a number in the working interval [0.0, 1.0) and then, it finds out the bin behind the given number. Since we have v identical pools, this final calculation has to be performed once, independently from the group size. We developed a study to investigate the overall load migration, whenever a new generation of bins has to be attached. We assume that RS-Pools starts with v={5,6,7} pools, and each pool has an initial bin with capacity b0=1 TiB. When the k-th generation of v new bins is introduced, each bin has a capacity bk = 1.5 bk-1. We also assume that each redundancy group is made up with R={3,4,5} balls, each with size equal to 1 MiB. To compare with RUSHp under similar circumstances, we consider that, each time the system is about to scale up, a subcluster with v new bins is attached. Therefore, the overall capacity on each new bin generation is exactly the same for either RS-Pools, or RUSHp. Results presented in figures 4 and 5, show the overall load migration after a new bin generation is attached and the system recovers its balance. For either RS-Pools or RUSHp, the system settles down to its long term level when the 6th bin generation is introduced. In the case of RS-Pools, it reaches asymptotically a limit which is 0.33. Also, we observe that this behavior is independent from the number of pools or the number of redundant balls, i.e. the redundancy group size. Meanwhile, RUSHp stabilizes above 0.73, it is very sensitive to the group size and, in the long term, slightly sensitive to the total number of active bins. Fig. 4. RS-Pools reallocation rate. Fig. 5. RUSHp reallocation rate. Let us assume now that the proxy faces one of the following conditions: it receives a big number of requests within a very short window of time, or it suddenly interrupts its operations and it is considered out of service. In order to tolerate any of the above conditions, we should deploy a redundant set of proxies but also, a replicated and consistent metadata file on each of them. It is known that the protocol Paxos offers a very efficient procedure to build a consistent recording of a replicated database. The approach followed by the also called part-time parliament protocol, consists in solving the consensus problem by means of an appointed leader and a quorum of active entities, supporting persistence of the values already decided. C. Paxos and medata consistency What is the metadata that must be recorded at the proxy in order to support the recovery of a file, previously stored? 239 247 We have developed an implementation of Paxos, currently under validation. As it is stated in the original proposal, the appointed leader enforces the accomplishment of the safety conditions. Meanwhile, liveness is guaranteed provided that there is exactly one leader. For this purpose we have implemented a heart-beat mechanism that triggers a new election procedure, when the current leader interrupts its pulses. The FLP theorem [31] shows the impossibility of consensus under an asynchronous communication model. The heart-beat mechanism is an alternative to overcome the limitations due to the lack of a universal clock. The team that built GFS developed its proprietary implementation of Paxos[24] called Chubby. They mention that there is a long road from the description that appears on the initial paper, to the final programming of Chubby. This road passes by several stops including modeling, validation, as well as performance testing. VI. CONCLUSIONS AND FURTHER WORK On this paper we have briefly introduced the design considerations that have led the construction of the Babel File System. This is a large scale, highly dependable storage system, based on a rather flexible set of components. We consider that our proposal can be understood as a lego-type family of solutions that can be easily fitted to different applications. Accordingly, people in charge of storage systems deployment may consider the possibility of using Babel, as they have enough leeway to settle a trade-off between price and performance, depending on their particular priorities. We have carefully addressed three basic conditions that, from our view, will provide the foundations of an effective and long lasting system: i) reliability, ii) scalability, and iii) service times. Redundancy is the key to reliability, which grants service continuity. In turn, service continuity means that the system should be interrupted at least as possible, not only during failures, but also when the system reaches its limit and new components must be attached to scale up the overall storage capacity. Each time the system grows, it enters into a transitory condition where some of the objects (blocks) already stored have to be reallocated in order to recover load balance. A thoughtful design must consider two critical issues during this stage. First, the system should reassign as least blocks as possible. Second, metadata updating should be kept in mind. Finally, service times may also profit from the fact that storage devices have processing capacities. Therefore, as there exists an important number of such devices, parallelism (and load balance) is fundamental to achieve short service times. To show the potential of Babel, we developed a couple of client applications that base their work on the storage capacities supported by our system. We built a storage service that provides the interface of a WebDAV server, based on the IETF RFC4918. On the other hand, we built a PACS (Picture Archiving and Communications System), according to the NEMA DICOM standard, also known as the ISO12052:2006 standard. Among the lessons that we learned from these accomplishments, we realized that Babel has a flexible interface that can be easily extended to fit the requirements of high volume storage-based services such as authoring and versioning systems, or those oriented to manage high volumes of heavy image files. We are currently working on different directions for immediate work. Among the strategic issues that we are about to address we consider the deployment of a parallel query platform over the set of processing nodes that make up the storage cell [37]. A second issue is the possibility of building a federation of storage cells. As this federation may grow, we observe that it would be difficult to have a centralized control. To deal with this challenge, we consider that P2P systems may provide us with interesting ideas that could be adapted to our needs. For instance, we are considering the possibilities of building semantic capacities on top of a cell federation resembling a P2P network [38, 39]. V. THE POSSIBILITIES OF BABEL In this section we shortly describe 2 different applications that we developed based on the storage capacities supported by Babel. Once in operation, Babel can be understood as a black-box having a simple command line interface that can be extended to fit the requirements of different services such as the WebDAV Server or the Picture Archiving and Communications System (PACS), that we are about to introduce. The Web Distributed Authoring and Versioning Protocol (WebDAV), defined by the IETF RFC 2518, 3252 & 4918, is an extension of the Hypertext Transfer Protocol (http) that turns a web site into repository with reading and writing capabilities, where authors find support for file management operations, such as those available in an ordinary file system, i.e. files and directories creation, changes, erasure, protection against overwriting, etc. We built an initial cloud storage service that evolved and turned into a WebDAV server, on top of the Babel interface. It is quite interesting to mention that we tested this service a couple of commercial WebDAV [32, 33] clients, including BitKinex and AnyClient. In turn, a Picture Archiving and Communications System (PACS) [34], defined by the NEMA DICOM standard, also known as the ISO12052:2006 standard, is designed to articulate the many different devices involved on the production, display, storage, retrieval and printing of medical image files. Our implementation is based on the PixelMed toolkit [35], which is a free and open source library. We built a PACS storage server that supports a subset of the services described according to the DICOM conformance specifications [36]. Our design is based on the PixelMed toolkit, which is a set of free, libre and open source libraries implementing code for reading and creating data, network and file support, object database management, display of directories, images, reports and spectra, and object validation. The architecture that we propose allows a high cohesion and low coupling since it is simple to replace the communication with any database handler. On the other side, the integration of http ensures a portable communication with the interface of Babel. REFERENCES 240 248 [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] S. Ghemawat, H. Gobioff, and Shun-Tak Leung. “The Google file system,” SIGOPS Oper. Syst., Oct. 2003, pp. 29-43, doi:10.1145/1165389.945450 M.. Palankar, A. Iamnitchi, M. Ripeanu and S. Garfinkel. “Amazon S3 for science grids: a viable solution?,” Proc. International workshop on Data-aware distributed computing (DADC '08), ACM, June 2008, pp 55-64, doi:10.1145/1383519.1383526 CloudNAS, http://en.wikipedia.org/wiki/Nirvanix, 2014 Skydrive, http://skydrive.live.com, 2014 I. Ion, N. Sachdeva, P. Kumaraguru and S. apkun, “Home is safer than the cloud!: privacy concerns for consumer cloud storage,” In Proceedings of the Seventh Symposium on Usable Privacy and Security (SOUPS '11), Article 13 , 20 pages, doi:10.1145/2078827.2078845. E. Walker, W. Brisken and J. Romney, “To Lease or Not to Lease from Storage Clouds,” Computer , vol.43, no.4, pp.44-50, April 2010 doi: 10.1109/MC.2010.115 J. L. Gonzalez and R. Marcelin-Jimenez., "Phoenix: A Fault-Tolerant Distributed Web Storage Based on URLs," Parallel and Distributed Processing with Applications (ISPA), IEEE 9th International Symposium on, pp. 282-287, May 2011, doi: 10.1109/ISPA.2011.33 E. Chai , M. Uehara, M. Murakami, M. Yamagiwa, "Online Web Storage Using Virtual Large-Scale Disks," Complex, Intelligent and Software Intensive Systems (CISIS '09), International Conference on , vol., no., pp.512,517, 16-19 March 2009, doi: 10.1109/CISIS.2009.74 J. LBorges. El jardín de los senderos que se bifurcan; Editorial Sur, 1941. R. O. Weber, “Information Technology – SCSI object-based storage device commands (OSD),” Technical Council Proposal Document T10/1355-D, Technical Committee T10. R. J. Honicky and E. Miller, “A Fast Algorithm for Online Placement and Reorganization of Replicated Data,” Parallel and Distributed Processing Symposium, 2003, pp., 22-26. doi: 10.1109/IPDPS.2003.1213151 S.A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn, “Ceph: a scalable, high-performance distributed file system,” In Proceedings of the 7th symposium on Operating systems design and implementation (OSDI '06), 2006, pp. 307-320. K. Shvachko, H. Kuang, S. Radia and R. Chansler, "The Hadoop Distributed File System," Mass Storage Systems and Technologies (MSST), IEEE 26th Symposium on, pp. 1-10, May 2010 doi: 10.1109/MSST.2010.5496972 P. Schwan, “Lustre: Building a file system for 1000 node closters,” Symposium, Linux, 2003. Gluster Community, http://www.gluster.org, 2014. J. Lee, B. Tierney and W. Johnston, “Data Intensive Distributed Computing: A Medical Application Example,” 7th. International Conference on High-Performance Computing and Networking (HPCN Europe),1999, pp. 150-158 B. L. Tierney, J. Lee, B. Crowley and M. Holding, “A NetworkAware Distributed Storage Cache for Data Intensive Environments High Performance Distributed Computing Conference (HPDC ’99), 1999, Pages 185-193. Z. Ali and Q. Malluhi, “NSM: A Distributed Storage Architecture for Data-Intensive Applications,” 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS ‘03), 2003, Page 87. A. Miranda, S. Effert, Y. Kang, E. L. Miller, A. Brinkmann, T. Cortes, "Reliable and randomized data distribution strategies for large scale storage systems," High Performance Computing (HiPC), 2011 18th International Conference on, pp. 18-21, Dec. 2011, doi: 10.1109/HiPC.2011.6152745 K. E. Wiegers, “Software Requirements 2: Practical techniques for gathering and managing requirements throughout the product [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] 241 249 development cycle,” 2nd ed., 2003, Redmond: Microsoft Press. ISBN 0-7356-1879-8. A. Stellman and J. Greene, “Applied Software Project Management. Cambridge,” MA: O'Reilly Media. ISBN 0-596-00948-8. Ian Sommerville, “Software Engineering,” 8th ed., 2008, AddisonWesley, ISBN 0-321-31379-8. R. J. Honicky, E. L. Miller, "Replication under scalable hashing: a family of algorithms for scalable decentralized data distribution," Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, pp. 26-30, April 2004 doi: 10.1109/IPDPS.2004.1303042 T. D. Chandra, R. Griesemer and J. Redstone. Paxos made live: an engineering perspective, In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing (PODC '07), ACM, 2007, pp. 398-407 doi:10.1145/1281100.1281103 L. Lamport, “Paxos made simple,” ACM SIGACT news distributed computing column 5 (SIGACT) News 32, 4 , Dec. 2001, pp. 34-58. doi:10.1145/568425.568433 M. O. Rabin, “Efficient dispersal of information for security, load balancing and fault tolerance,” Journal of the ACM, vol. 36(2), pp. 335-348, April 1989, doi:10.1145/62044.62050 H. Weatherspoon and J. Kubiatowicz, “Erasure Coding Vs. Replication: A Quantitative Comparison.,” In Revised Papers from the First International Workshop on Peer-to-Peer Systems (IPTPS '01), 2002, pp. 328-338. M. Quezada-Naquid, R. Marcelín-Jiménez and M. Lopez-Guerrero, “Fault Tolerance and Load Balance Tradeoff in a Distributed Storage System,” Computación y Sistemas, October-December 2010, vol. 14, num. 2, pp. 151-163. A. Varga, http://www.omnetpp.org, 2014. M. Mitzenmacher and E. Upfal, “Probability and Computing: Randomized Algorithms and Probabilistic Analysis,”. Cambridge University Press, 2005. Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. “Impossibility of distributed consensus with one faulty process,” J. ACM, 1985, 32, 2, pp. 374-382, doi: 10.1145/3149.214121. E. J. Whitehead Jr. and Y. Y. Goland, “WebDAV: a network protocol for remote collaborative authoring on the Web,” In Proceedings of the sixth conference on European Conference on Computer Supported Cooperative Work (ECSCW '99), Norwell, pp. 291-310. P. Gambarotto and P. Aubry, “ESUP-Portail: a pure WebDAV-based Network Attached Storage,” EUNIS2004, Bled, Slovenia, July 2004. H.K. Huang, “PACS and Imaging Informatics, Basic Principles and Applications,” (2nd. Ed.) Wiley-Blackwell, 2010. D. A. Clunie, “PixelMed publishing,” http://www.pixelmed.com/, July 2013. O.S. Pianykh, “Digital Imaging and Communications in Medicine (DICOM), A Practical Introduction and Survival Guide,” Springer, 2011. J. L. Gonzalez, J. Carretero Perez, V. J. Sosa-Sosa, J. F. Rodriguez Cardoso and R. Marcelin-Jimenez, “An approach for constructing private storage services as a unified fault-tolerant system. J. Syst. Softw. Vol. 86, pp. 1907-1922, July 2013, doi:10.1016/j.jss.2013.02.056 D. Bermbach, M. Klems, S. Tai and M. Menzel, “MetaStorage: A federated cloud storage system to manage consistency-latency tradeoffs, Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing (CLOUD '11), IEEE Computer Society, 2011, pp. 452-459, doi:10.1109/CLOUD.2011.62. R. Ranjan, R. Buyya and A. Harwood, “A model for cooperative federation of distributed clusters,” Proceedings of the High Performance Distributed Computing (HPDC-14), 2005, pp. 295-296, doi: 10.1109/HPDC.2005.1520982.
© Copyright 2026 Paperzz