The Data Grid: Towards an Architecture for the Distributed

The Data Grid:
Towards an Architecture for the Distributed Management
and Analysis of Large Scientific Dataset
Caitlin Minteer & Kelly Clynes
The Data Grid




Large dataset size
Geographic distribution of users and
resources
Computationally intensive analysis
No other architecture exists that allows
us to apply technologies in large scale
application domains
The Data Grid

Data grid applications must frequently
operate in wide area, multi-institutional
diverse environments
Design Architecture for
The Data Grid

Mechanism Neutrality


Designed to be as independent as
possible of low level mechanisms
Defining interfaces that sum up oddness
of specific storage systems.
Design Architecture for
The Data Grid

Policy Neutrality

Structured so that design decisions with
significant performance implications are
exposed to the user
Design Architecture for
The Data Grid

Compatibility with Grid Infrastructure


Take advantage of fundamental Grid
infrastructure
Compatible with lower level Grid
mechanisms
Design Architecture for
The Data Grid

Uniformity of Information Infrastructure

The same data model and interface used
to access the grids metadata
Design Architecture for
The Data Grid



These four principals lead us to
development of a layered architecture.
Lower layers provide high performance
access to a statistical set of devices.
In data grids, the focus on simple, policyindependent mechanisms will encourage
and enable wide use without limiting the
range of applications that can be applied.
Core Grid Data Services

Two fundamental services required in
data grid architecture:


Data Access
Metadata Access
Data Access

Provides mechanisms for accessing,
managing, and initiating third party
transfers of data stored in storage
systems
Metadata Access

Provides mechanisms for accessing
and managing information about data
stored in storage systems
Data Abstraction:
Storage System

Basic grid component is the Storage System which
provides functions for creating, destroying, reading,
writing and manipulation file instances

File instances are basic unit of information in a
storage system

A Storage system implemented by any storage
technology that can support the required access
functions
Data Access:



Storage system access functions must be
included with the security environment of
each site to which remote access is required
Applications should be able to provide
storage systems with hints concerning
access patterns, network performance, etc,
that the storage system can use to optimize
performance
Data movement functions must be able to
detect and report errors
Metadata



Management of the data grid itself
Information about file instances, the
contents of file instances, and the
various storage systems contained in
the grid
The metadata service provides the
way to publish and access the data
Application Metadata

Describes the contents and structure
of the data



Content represented by the file
Circumstances under which the data was
obtained
Other info useful to applications that
process the data
Replica Metadata


Used to manage replication of data
objects
Includes information for mapping file
instances to a particular storage
system locations
System Configuration Metadata


Describes the fabric of the grid itself
i.e network connectivity and details
about storage systems


Capacity
Usage policy
Additional Requirements




Service must operate efficiently in a
distributed environment
Scalable
Robust
Assert Local Control over information
Hierarchical Distributed System

Because of these, the metadata
service must be hierarchical distributed
system



Achieve scalability
Avoid single points of failure
Facilitate local control over data
Higher-Level Data Grid Components

Two types of representative
components:


Replica management
Replica selection
Replica Management




Replica Manager
Create copies of file instances, or
replicas, within specified storage
systems
Offers better performance or
availability for access to or from a
particular location
Maintains repository or catalog
Replica Selection and Data Filtering

High level service provided in the data
grid is Replica Selection

Optimize performance principles




Speed
Cost
Security
Replicas may be local or accessed
remotely
Summary

Architecture of the Data Grid





Data Services



Mechanism Neutrality
Policy Neutrality
Compatibility with Grid Infrastructure
Uniformity of information infrastructure
Data Access
Metadata Access
Replica Management