Policy-Driven Data Management for Distributed Scientific

Policy-Driven Data Management for Distributed Scientific Collaborations Using a Rule Engine
Sara Alspaugh, Ann Chervenak, Ewa Deelman
Data-intensive distributed science applications increasingly depend on efficient access to data sets up to
petabytes in size as well as to distributed high performance computational resources. Such applications include
high-energy physics, gravitational wave physics, climate modeling, computational biology and earthquake
science. Scientific collaborations require flexible and secure sharing of diverse and geographically distributed
storage and computing resources. Data sets generated on an experimental apparatus or on computational
resources need to be distributed widely to scientists in the collaboration, often according to policies set by the
institutions that make up the collaboration or Virtual Organization (VO) [1]. These policies may pertain to the
dissemination, security, and reliability of data sets [2]. For example, in high energy physics, data sets generated
at CERN are distributed according to a tiered policy that places decreasing subsets of data at sites in each tier,
giving physicists access to data and maintaining multiple copies of all data sets [3]. In gravitational wave
physics, data are disseminated based on metadata queries performed by scientists at each site [4]. The IRODS
system allows the specification of complex policies for data replication, consistency and metadata management
using a rule engine [5].
In this work, we integrate an open source rule engine called Drools [6] with existing services for grid
data management, GridFTP [7] and the Globus Replica Location Service [8]. A rule engine encodes knowledge
as rules, which are then used to process input data and produce a decision based on the rules. Our rule engine
takes as input information about the current state of data in the distributed environment from the Replica
Location Service. It then enforces policies for data replication and dissemination, using the GridFTP data
transfer service to create new data objects.
The goal of our work is to demonstrate that a rule engine is well suited to the problem of policy-based
data management for distributed science applications. Data management policies can be readily characterized as
rules that govern the maintenance and movement of data and which can be used to make decisions based on the
current state of data and resources within a VO. Moreover, data management policies are dependent upon the
requirements of a VO, which may change over time, making the flexibility of a rule engine-based approach
attractive in contrast to traditional algorithmic approaches.
In particular, our work demonstrates that the Drools open source rule engine can be easily and
effectively integrated with existing grid tools for distributed data management. We show results for two practical
data management policies: distributing data sets based on a tier-like dissemination model and maintaining a
specified number of replicas of each data item in the distributed system.
For our experiments, the Drools engine, a Replica
Location Service, and a GridFTP client ran on a single core
i686 GNU/Linux machine with 1GB of memory. We
distributed data on a cluster of eight nodes, each of which
were i686 GNU/Linux machines with 2GB of main memory.
This experimental setup is illustrated in Figure 1.
The first policy specifies a tier-based pattern for
disseminating data products throughout a system upon initial
publication. In this instance, seven cluster nodes were
categorized into three tiers of 1, 2 and 4 nodes, respectively.
We assumed that 800 files of size 1 MByte have been
published at the top tier node and registered in the RLS prior
to the start of the experiment. At the beginning of the
experiment, our Policy-Driven Data Placement Service
queries the RLS to acquire the list of files to be distributed,
Figure 1 - Experimental Setup.
which takes 1.2 seconds, then encapsulates this information
as facts, and inserts them into the Drools rule engine, which
takes 1.6 seconds. Next, the placement service initiates third-party GridFTP transfers to disseminate the data
according to the policy, which has been encoded as rules. It first copies 400 files to each of the nodes in the
second tier level and then copies 200 files to each node in the third tier level. Completely disseminating all of
the files in this manner takes approximately 323 seconds.
Figure 2 - Execution of Tier-Based Data Dissemination Policy
Figure 3 – Execution of Policy Enforcing Maintenance of Three Replicas
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
In our second experiment, we
implemented a policy that maintains a
specified number of replicas for each file.
Prior to the start of the experiment, there
are two copies each of 800 files on 8
cluster nodes, or 200 files per node. Each
file was 1 Mbyte in size. The policy
enforced required that three copies be
stored for each file, subject to constraints.
Our application first queried the RLS for
information about each of these files and
compared this with information about the
storage nodes. Based on this comparison
and the restrictions specified in the rules,
our application selected source and
destination nodes and initiated file
transfers for each of the 800 files. Figure 3
shows the increase in the number of copies
over time. It took 806 seconds for the
placement service to create a third copy
per file.
We have successfully integrated Drools
rule engine with existing tools for
distributed data management in grid
environments. We demonstrated two
policies based on typical requirements for
managing data based on the needs of
scientific virtual organizations. Potential
future work includes using rule engine
benchmarks to better evaluate the
performance of Drools; integrating the
rule engine with additional Grid services,
particularly grid information services;
exploring the full set of rule engine
features; implementing alternative
approaches to policy-driven data
management and comparing their
performance to our implementation; and
implementing a greater variety of policies
in complex Grid environments.
I. Foster, C. Kesselman, and S. Tuecke, "The Anatomy of the Grid: Enabling Scalable Virtual Organizations," International Journal of High
Performance Computing Applications, vol. 15, pp. 200-222, 2001.
A. Chervenak, E. Deelman, M. Livny, M. Su, R. Schuler, S. Bharathi, G. Mehta, K. Vahi, "Data Placement for Scientific Applications in
Distributed Environments," presented at 8th IEEE/ACM Int'l Conference on Grid Computing (Grid 2007), Austin, Texas, 2007.
J. Rehn, T. Barrass, D. Bonacorsi, J. Hernandez, I. Semeniouk, L. Tuura, and Y. Wu, "PhEDEx high-throughput data transfer management
system," presented at Computing in High Energy and Nuclear Physics (CHEP) 2006, Mumbai, India, 2006.
A. Chervenak, R. Schuler, C. Kesselman, S. Koranda, B. Moe, "Wide Area Data Replication for Scientific Collaborations," presented at 6th
IEEE/ACM Int'l Workshop on Grid Computing (Grid2005), Seattle, WA, USA, 2005.
M. Hedges, A. Hasan, and T. Blanke, "Management and preservation of research data with iRODS," Proceedings of the ACM first workshop on
CyberInfrastructure: information management in eScience, pp. 17-22, 2007.
Drools project, "Drools, http://www.jboss.org/drools/," 2008.
W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, I. Foster, "The Globus Striped GridFTP Framework and Server,"
presented at IEEE Supercomputing (SC05) Conference, Seattle, WA, 2005.
A. L. Chervenak, N. Palavalli, S. Bharathi, C. Kesselman, R. Schwartzkopf, "Performance and Scalability of a Replica Location Service,"
presented at Thirteenth IEEE Int'l Symposium High Performance Distributed Computing (HPDC-13), Honolulu, HI, 2004