NPACI Neuroscience

Storage Resource Broker
Federating Archives in the
DELAMAN Network
Reagan W. Moore
San Diego Supercomputer Center
[email protected]
http://www.npaci.edu/DICE/SRB
Distributed Data Management
Using Data Grids
• Build a shared collection
• Authenticate users independently of the storage
systems
• Control access independently of the storage
systems
• Organize the file name space independently of
the storage systems
• Manage context (metadata) independently of
content (files)
• Maintain consistency between context and
operations on content
Storage Resource Broker
• Generic distributed data management
technology
• Data grids - sharing
• Digital libraries - publication
• Persistent archives - preservation
• Federated server architecture / thin client
• 250,000 lines of “C” code
• Supports all major compute and storage platforms
• All requirements listed on following Scenario
slides are supported
Scenario 1- Data Migration
• Provide URIDs (logical file names) that are
independent of storage system
• Provide metadata for each file
• Support browse and discovery on collection
hierarchy
• Support access interfaces to the data
• Support registration of existing files into a
shared collection
• Single sign-on environment
• GSI / challenge response / tickets
Managing Distributed Data
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Storage Repository
• Storage location
• User name
• File name
• File context (creation date,…)
• Access constraints
Naming conventions
provided by storage
systems
Data Grids Provide a Level of Indirection
for Each Naming Convention
Data Access Methods (C library, Unix, Web Browser)
Data Collection
Storage Repository
Data Grid
• Storage location
• Logical resource name space
• User name
• Logical user name space
• File name
• Logical file name space (URID)
• File context (creation date,…)
• Logical context (metadata)
• Access constraints
• Control/consistency constraints
Data is organized as a shared collection
Provide Context for Data
• Properties of files
• Provenance - source
• Descriptive attributes
• State information resulting from operations on files
• Organize properties as metadata in a collection hierarchy
• Define operations on file properties
• Manage state information - location, replicas, containers, checksums
• Separate context management from content management
• Maintain consistency of context as operations are done on content
• Support context management
• Schema extension, automated SQL generation, bulk metadata load
• Metadata extraction through a remote procedure parsing the file
Federated Server Architecture
Read Application
Logical Name
Or
Attribute Condition
Peer-to-peer
Brokering
Parallel Data
Access
1
6
SRB
server
3
SRB
server
4
SRB
agent
5
SRB
agent
1.Logical-to-Physical mapping
2.Identification of Replicas
3.Access & Audit Control
5/6
2
R1
MCAT
Data
Access
R2
Server(s)
Spawning
Storage Resource Broker - Data Grid
Application
C, C++, Java Linux
Libraries
I/O
Unix
Shell
Java, NT
Browser
Kepler Actors
DLL /
Python,
Perl
HTTP
DSpace
OpenDAP
OAI,
WSDL,
WSRF
Federation Management
Consistency & Metadata Management / Authorization,Authentication,Audit
Logical Name
Space
Catalog Abstraction
Databases
DB2, Oracle, Sybase,
Postgres, mySQL,
Informix
Latency
Management
Data
Transport
Metadata
Transport
Storage Repository Virtualization
Databases
Archives - Tape,
Sam-QFS, DMF, ORB File Systems DB2, Oracle, Sybase,
Unix, NT, SQLserver, Postgres,
HPSS, ADSM,
Mac OSX
mySQL, Informix
UniTree, ADS
Scenario 2 - Data Exchange
• Support access controls on the URIDs
• Java administration GUI to support owner control of
access controls
• Can delegate permission to set access controls
• Access controls apply on all replicas independent of
storage system
• Support latency management for moving files
across wide area networks
• Parallel I/O, replication, staging, aggregation of data /
metadata / I/O commands
• Support integrity validation
• Manage checksums for each file
Latency Management -Bulk Operations
• Bulk register
• Create a logical name for a file
• Bulk load
• Create a copy of the file on a data grid storage repository
• Bulk unload
• Provide containers to hold small files and pointers to each file location
• Bulk delete
• Mark as deleted in metadata catalog
• After specified interval, delete file
• Bulk metadata load
• Support parsing of metadata from a remote file at remote storage
• Requests for bulk operations for access control setting,
…
Scenario 3 - Community Access
• Within the shared collection, the digital entities are
owned and managed by the data grid
• Files, URLs, SQL commands, database binary large objects can
be registered into the shared collection
• Access controls for
• Files / metadata / storage systems
• Access controls are defined for multiple roles
•
•
•
•
•
•
Schema extension, create new metadata
Modify metadata
Add annotations
Turn on audit trails
Write data
Read data
Scenario 4 - Explorative Studies
• Uniform access mechanisms to data across
all storage systems
• Support for queries on databases
• Support for formatting results (XML, HTML)
• Support audit trails, encryption
• Support user-defined collection hierarchy
• Soft links (build a logical collection of pointers to data
within the data grid)
• Support for multiple types of discovery
• By URID (Logical File Name)
• By query on metadata (may be unique to a single file)
• By GUID (handle system)
Scenario 5 - Education
• SRB is used to build digital libraries
• Assemble class material
• Manage student reports
• Display material through web browsers
• Federation of digital libraries
• Controlled sharing across independent data grids or
digital libraries
• Support for cross-registration of logical name spaces
• Authentication done by “home” data grid
• Access controls managed by both data grids
Federation
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Collection A
Data Grid
Data Collection B
Data Grid
• Logical resource name space
• Logical resource name space
• Logical user name space
• Logical user name space
• Logical file name space
• Logical file name space
• Logical context (metadata)
• Logical context (metadata)
• Control/consistency constraints
• Control/consistency constraints
Access controls and consistency constraints
on cross registration of digital entities
Scenario 6 - Updating Resources
• Maintain system level metadata
• Owner of registered file
• Creation time, modification time, size, audit trails
• Replica locations
• Support for synchronization of replicas
• Can modify a replica, subsequent reads are to the
modified copy
• Can synchronize copies to the modified version
• Support for physical file containers
• Aggregate small files before storage
Scenario 7 - Web-based Editions
• Support for digital library interfaces on top of
the data grid
• Transana - technology to manipulate, edit, and
manage classroom video (University of Wisconsin)
• DSpace - digital library system to manage ingestion of
material into a collection
• OAI-PMH - Open Archives Initiative protocol for
metadata harvesting
• OpenDAP - Data Access Protocol that supports both
semantic and structural manipulation of registered files
• Windows browser, Web browser, Java, WSDL
interfaces
• Collaborating on development of portlet interface
Storage Resource Broker - Data Grid
Application
C, C++, Java Linux
Libraries
I/O
Unix
Shell
Java, NT
Browser
Kepler Actors
DLL /
Python,
Perl
HTTP
DSpace
OpenDAP
OAI,
WSDL,
WSRF
Federation Management
Consistency & Metadata Management / Authorization,Authentication,Audit
Logical Name
Space
Catalog Abstraction
Databases
DB2, Oracle, Sybase,
Postgres, mySQL,
Informix
Latency
Management
Data
Transport
Metadata
Transport
Storage Repository Virtualization
Databases
Archives - Tape,
Sam-QFS, DMF, ORB File Systems DB2, Oracle, Sybase,
Unix, NT, SQLserver,Postgres,
HPSS, ADSM,
Mac OSX
mySQL, Informix
UniTree, ADS
Scenario 8 - Unconnected Editions
• Ability to download data from shared
collection to local resource
• Support for PCs, workstations,
supercomputers
• Generalization of anonymous FTP
• Can issue a ticket permitting
• Limited number of read accesses valid for specified
time interval
• Can set public access to a sub-collection
• Can restrict access by user
name/domain/zone
Local Archives
• Maintain files in local file system
• Register existence of the files into the data
grid
• Issue synchronization command to replicate
into the archive
• Maintain a data grid on the local system
• Entire environment can be installed on a Mac
in 15 minutes (Perl install script)
• Use data grid federation to synchronize name
spaces, files, metadata from local data grid to
archives data grid
Scenario 9 - Collaborative Commmentary
• Comments can be added by owner
• Annotations can be added by authorized
persons
• Annotations marked by person name, date
• Can restrict annotation right by group
• Can choose to create explicit metadata
attributes to manage comments
• Can store multiple comments per object
• Can search across metadata
• Or can use digital library interfaces to
manage comments
Sites Using the SRB
Academia Sinica, Taiwan
ASCC, Computing Centre, Taiwan
Australian National University
Bedford Oceanography,Canada
Bioinformatics Institute, Singapore
CSIRO, Australia
Data Storage Institute, Singapore
EGEE, French National Center
GeoForschungsZentrum, Germany
James Cook University, Australia
KEK High Energy Physics, Japan
Max Planck Institute, Netherlands
Parallab, Norway
South Australian Advanced Computing
UIB (Parallab) , Norway
University of Amsterdam
University of Cambridge, Astronomy
University of Cambridge, e-Science
University of Edinburgh
University of Genoa, Italy
University of Hong Kong
Univrsity of Manchester
University of Oslo
University of Southampton
York Univ (UK)
CiteSeer, Penn State
City Univ. of New York
Geospatial Environment, UCSD
Drexel University
EOSDIS Distributed Active, NASA Goddard
Georgia Tech
Kentucky State Libraries & Archives
Library of Congress
Los Alamos National Lab
NASA Ames
NASA Goddard Space Flight Center
NCSA Grid Computing
NIH (NCI Center for Bioinformatics)
Penn State University
Pittsburgh Supercomputing Center
Purdue University. Indiana
Stanford University
TACC, University of Texas
Texas A & M
UC Santa Cruz
UCLA
UCSD Neuroscience
University of Maryland
University of Michigan, CAC department
University of New Mexico
University of Washington
University of Wisconsin
USC
Yale University
GBs of
data
stored
Storage Resource Broker Collections at SDS C
(11/2/2004 )
Data Gr id
NSF/ITR - National Virtual Observatory
NSF - National Partnership for Advanced Computational Infrastructure
Hayden Planetarium - Evolution of the Solar System visualizations
NSF/NPACI - Joint Center for Structural Genomics
NSF/NPACI - Biology and Environmental collections
NSF - TeraGrid, ENZO Cosmology simulations
К
Number Number
of files of Users
К
К
53,858
24,738
7,201
5,228
8,851
121,550
9,536,698
5,754,890
113,600
652,031
33,340
1,096,947
80
380
178
50
67
3,247
NIH - Biomedical Informatics Research Network
Digital Library
6,002
К
4,107,508
К
214
NLM - D igital Embryo image collection
NSF/NPACI - Long Term Ecological Reserve
NSF/NPACI - Grid Portal
NIH - Alliance for Cell Signaling microarray d ata
NSF - National Science Digital Library SIO Explorer collection
NSF/NPACI -Transana education research video collection
NSF/ITR - Southern California Earthquake Center
720
253
2,211
856
2,080
92
91,040
45,365
8,436
51,227
62,291
808,901
2,387
1,791,494
Persistent Archive
UCSD Libraries archive
NARA- Research Prototype Persistent Archive
NSF - National Science Digital Library persistent archive
TOTAL
К
К
128
204,828
166
316,813
3,571 26,908,350
328 TB
51 million
К
23
36
407
21
27
26
62
К
29
58
122
4,900
Generic Infrastructure
• SDSC developed the Storage Resource
Broker (SRB) to support access to distributed
data
• Effort started in 1996 as a DARPA funded project
• Now support over 30 national/international projects
• Development team of 12 staff is led by
• Michael Wan, data management systems
• Arcot Rajasekar , information management systems
SDSC SRB Team (left to right)
QuickTime™ and a
QuickTime™ and a
IFF (Uncomp resse d) de com press or
TIFF (Uncomp resse d) de com press or
QuickTime™ and a
are nee ded to s ee this picture.
are nee ded to s ee this picture. TIFF (Uncomp resse d) de com press or
are nee ded to s ee this picture.
QuickTime™ and a
QuickTime™ and a
QuickTime™ and a
TIFF (Uncomp resse d) de com press or
F (Uncomp resse d) de com press
or(Uncomp resse d) de com press or are nee ded to s ee this picture.
TIFF
are nee ded to s ee this picture. are nee ded to s ee this picture.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Arun Jagatheesan
George Kremenek
Sheau-Yen Chen
Arcot Rajasekar (SRB development
lead)
Reagan Moore (SRB PI)
Michael Wan (SRB architect)
Roman Olschanowsky (BIRN)
Bing Zhu
Charlie Cowart
Lucas Gilbert
Tim Warnock
Wayne Schroeder (SRB product)
Adam Birnbaum (SRB production)
Antoine De Torcy
Vicky Rowley (BIRN)
Marcio Faerman (SCEC)
Students & emeritus
•
•
•
•
•
•
•
•
•
•
Erik Vandekieft
Reena Mathew
Xi (Cynthia) Sheng
Allen Ding
Grace Lin
Qiao Xin
Daniel Moore
Ethan Chen
Jon Weinburg
Supported by overt 20 projects (NSF,
DOE, NASA, NARA, NIH, LOC,
NHPRC)
Data Grid Capabilities
• Data manipulation
• Containers
• Parallel I/O
• Firewall interactions
• Resource interactions
• Fault tolerance
• Load leveling
• Replication
• HIPAA security requirements
•
•
•
•
•
Authentication of all users
Access controls on data and metadata
Audit trails
Data encryption
Centralized control
• Application interfaces
• C library, Shell commands, Java, Perl, Python, WSDL, workflow
Data Management System Features
• Data grid for managing distributed data
• Latency management for bulk analyses of collections
• Infrastructure independent name spaces for describing
data, resources, users, and state information
• Digital library for managing data context
• Curation services for managing collections
• Descriptive metadata for discovery
• Persistent archive to manage technology
evolution
• Interoperability mechanisms between heterogeneous
storage systems and user access mechanisms
BIRN - Biomedical Informatics Research
Network Data Grid
Wash U.
Duke
NIH/NCRR Centers for
Imaging and Computing
Cal Tech
NPACI/
SDSC
UCLA
Harvard
Cal-(IT)2
“Deep Web”
“Surface Web”
Duke
Integrating Cyber Infrastructure to Link:
•Advanced Imaging Instruments
•Data Intensive Computing
•Multi-Scale Brain Databases
Wireless “Pad”
Web Interface
Digital Library
• Collection hierarchy for organizing data
• User-defined metadata
• Collection level metadata
• Metadata manipulation
•
•
•
•
•
Schema extension
Bulk metadata processing
Queries on metadata
Access controls on metadata
Views on collections
• Digital library APIs
• DSpace, Fedora, OAI-PMH, web browsers
• METS metadata XML schema
Southern California Earthquake Center
Select Receiver (Lat/Lon)
Store seismic data
•
Managing over 90 TBs, over 1.7
million files
•
Store community models for
seismic velocity
Select Scenario
•
Data distributed between USC,
Fault Model
SDSC
Source Model
SCEC community digital library
•
Storage Resource Broker data
grid technology
•
NMI portal interface
•
Digital library services to
display seismograms
•
Visualizations of seismic waves
at the surface
•
Visualization of seismic wave
propagation through the volume
Output
Time History
Seismograms
SCEC
Community
Library
National Virtual Observatory
Virtual Observatory Architecture
Discover Compute Publish Collaborate
Provide access
to large star
catalogs and
large image sky
surveys
Portals, User Interfaces, Tools
VOPlot
Topcat
SkyQuery
DIS
Aladin
Registry Layer
Data Services
HTTP Services
Compute Services
SOAP Services
Grid Services
self-describing
persistent,
crossmatch
visualization
ADS
Digital Library
Other registries
XML, DC, METS
Existing Data Centers
OpenSkyQuery
OAI
image
source
detection
data mining
Bulk Access
Semantics (UCD)
SIAP, SSAP
2MASS
SDSS
DPOSS
USNO-B
Macho
conVOT
interfaces to data
stateless, registered
authenticated
•
•
•
•
•
OASIS
Mirage
Virtual Data
Workflow (pipelines)
Authentication & Authorization
My Space storage services
Grid Middleware
SRB, Globus, OGSA
SOAP, GridFTP
Databases, Persistency, Replication
Disks, Tapes, CPUs, Fiber
National Science Digital Library
Preserve educational material
that has been registered into a
central repository at Cornell
through URLs
• Crawl web and retrieve
material, 10 levels of
indirection
• Convert internal URLs into
data grid handles
• Aggregate files into
containers for storage
• Preserve using SRB data
grid technology
• Currently housing over 26
million files
Web Interface to
Persistent Archive
National Archives and Records Administration Research Prototype Persistent Archive
Demonstrate preservation
environment
• Authenticity
• Integrity
• Management of
technology evolution
• Mitigation of risk of data loss
• Replication of data
• Federation of catalogs
• Management of preservation
metadata
• Scalability
• EAP collection
• 350,000 files
• 1.2 TBs in size
Federation of Three
Independent Data Grids
NARA
MCAT
Principle copy
stored at NARA
with complete
metadata catalog
U Md
MCAT
Replicated copy
at U Md for improved
access, load balancing
and disaster recovery
SDSC
MCAT
Deep Archive at
SDSC, no user
access, but
complete copy
For More Information
Reagan W. Moore
San Diego Supercomputer Center
[email protected]
http://www.npaci.edu/DICE
http://www.npaci.edu/DICE/SRB
http://www.npaci.edu/dice/srb/mySRB/mySRB.html