Policy-Driven Data Management for Distributed Scientific Collaborations Using a Rule Engine Sara Alspaugh, Ann Chervenak, Ewa Deelman Data-intensive distributed science applications increasingly depend on efficient access to data sets up to petabytes in size as well as to distributed high performance computational resources. Such applications include high-energy physics, gravitational wave physics, climate modeling, computational biology and earthquake science. Scientific collaborations require flexible and secure sharing of diverse and geographically distributed storage and computing resources. Data sets generated on an experimental apparatus or on computational resources need to be distributed widely to scientists in the collaboration, often according to policies set by the institutions that make up the collaboration or Virtual Organization (VO) [1]. These policies may pertain to the dissemination, security, and reliability of data sets [2]. For example, in high energy physics, data sets generated at CERN are distributed according to a tiered policy that places decreasing subsets of data at sites in each tier, giving physicists access to data and maintaining multiple copies of all data sets [3]. In gravitational wave physics, data are disseminated based on metadata queries performed by scientists at each site [4]. The IRODS system allows the specification of complex policies for data replication, consistency and metadata management using a rule engine [5]. In this work, we integrate an open source rule engine called Drools [6] with existing services for grid data management, GridFTP [7] and the Globus Replica Location Service [8]. A rule engine encodes knowledge as rules, which are then used to process input data and produce a decision based on the rules. Our rule engine takes as input information about the current state of data in the distributed environment from the Replica Location Service. It then enforces policies for data replication and dissemination, using the GridFTP data transfer service to create new data objects. The goal of our work is to demonstrate that a rule engine is well suited to the problem of policy-based data management for distributed science applications. Data management policies can be readily characterized as rules that govern the maintenance and movement of data and which can be used to make decisions based on the current state of data and resources within a VO. Moreover, data management policies are dependent upon the requirements of a VO, which may change over time, making the flexibility of a rule engine-based approach attractive in contrast to traditional algorithmic approaches. In particular, our work demonstrates that the Drools open source rule engine can be easily and effectively integrated with existing grid tools for distributed data management. We show results for two practical data management policies: distributing data sets based on a tier-like dissemination model and maintaining a specified number of replicas of each data item in the distributed system. For our experiments, the Drools engine, a Replica Location Service, and a GridFTP client ran on a single core i686 GNU/Linux machine with 1GB of memory. We distributed data on a cluster of eight nodes, each of which were i686 GNU/Linux machines with 2GB of main memory. This experimental setup is illustrated in Figure 1. The first policy specifies a tier-based pattern for disseminating data products throughout a system upon initial publication. In this instance, seven cluster nodes were categorized into three tiers of 1, 2 and 4 nodes, respectively. We assumed that 800 files of size 1 MByte have been published at the top tier node and registered in the RLS prior to the start of the experiment. At the beginning of the experiment, our Policy-Driven Data Placement Service queries the RLS to acquire the list of files to be distributed, Figure 1 - Experimental Setup. which takes 1.2 seconds, then encapsulates this information as facts, and inserts them into the Drools rule engine, which takes 1.6 seconds. Next, the placement service initiates third-party GridFTP transfers to disseminate the data according to the policy, which has been encoded as rules. It first copies 400 files to each of the nodes in the second tier level and then copies 200 files to each node in the third tier level. Completely disseminating all of the files in this manner takes approximately 323 seconds. Figure 2 - Execution of Tier-Based Data Dissemination Policy Figure 3 – Execution of Policy Enforcing Maintenance of Three Replicas [1] [2] [3] [4] [5] [6] [7] [8] In our second experiment, we implemented a policy that maintains a specified number of replicas for each file. Prior to the start of the experiment, there are two copies each of 800 files on 8 cluster nodes, or 200 files per node. Each file was 1 Mbyte in size. The policy enforced required that three copies be stored for each file, subject to constraints. Our application first queried the RLS for information about each of these files and compared this with information about the storage nodes. Based on this comparison and the restrictions specified in the rules, our application selected source and destination nodes and initiated file transfers for each of the 800 files. Figure 3 shows the increase in the number of copies over time. It took 806 seconds for the placement service to create a third copy per file. We have successfully integrated Drools rule engine with existing tools for distributed data management in grid environments. We demonstrated two policies based on typical requirements for managing data based on the needs of scientific virtual organizations. Potential future work includes using rule engine benchmarks to better evaluate the performance of Drools; integrating the rule engine with additional Grid services, particularly grid information services; exploring the full set of rule engine features; implementing alternative approaches to policy-driven data management and comparing their performance to our implementation; and implementing a greater variety of policies in complex Grid environments. I. Foster, C. Kesselman, and S. Tuecke, "The Anatomy of the Grid: Enabling Scalable Virtual Organizations," International Journal of High Performance Computing Applications, vol. 15, pp. 200-222, 2001. A. Chervenak, E. Deelman, M. Livny, M. Su, R. Schuler, S. Bharathi, G. Mehta, K. Vahi, "Data Placement for Scientific Applications in Distributed Environments," presented at 8th IEEE/ACM Int'l Conference on Grid Computing (Grid 2007), Austin, Texas, 2007. J. Rehn, T. Barrass, D. Bonacorsi, J. Hernandez, I. Semeniouk, L. Tuura, and Y. Wu, "PhEDEx high-throughput data transfer management system," presented at Computing in High Energy and Nuclear Physics (CHEP) 2006, Mumbai, India, 2006. A. Chervenak, R. Schuler, C. Kesselman, S. Koranda, B. Moe, "Wide Area Data Replication for Scientific Collaborations," presented at 6th IEEE/ACM Int'l Workshop on Grid Computing (Grid2005), Seattle, WA, USA, 2005. M. Hedges, A. Hasan, and T. Blanke, "Management and preservation of research data with iRODS," Proceedings of the ACM first workshop on CyberInfrastructure: information management in eScience, pp. 17-22, 2007. Drools project, "Drools, http://www.jboss.org/drools/," 2008. W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, I. Foster, "The Globus Striped GridFTP Framework and Server," presented at IEEE Supercomputing (SC05) Conference, Seattle, WA, 2005. A. L. Chervenak, N. Palavalli, S. Bharathi, C. Kesselman, R. Schwartzkopf, "Performance and Scalability of a Replica Location Service," presented at Thirteenth IEEE Int'l Symposium High Performance Distributed Computing (HPDC-13), Honolulu, HI, 2004
© Copyright 2026 Paperzz