Storage Resource Broker Federating Archives in the DELAMAN Network Reagan W. Moore San Diego Supercomputer Center [email protected] http://www.npaci.edu/DICE/SRB Distributed Data Management Using Data Grids • Build a shared collection • Authenticate users independently of the storage systems • Control access independently of the storage systems • Organize the file name space independently of the storage systems • Manage context (metadata) independently of content (files) • Maintain consistency between context and operations on content Storage Resource Broker • Generic distributed data management technology • Data grids - sharing • Digital libraries - publication • Persistent archives - preservation • Federated server architecture / thin client • 250,000 lines of “C” code • Supports all major compute and storage platforms • All requirements listed on following Scenario slides are supported Scenario 1- Data Migration • Provide URIDs (logical file names) that are independent of storage system • Provide metadata for each file • Support browse and discovery on collection hierarchy • Support access interfaces to the data • Support registration of existing files into a shared collection • Single sign-on environment • GSI / challenge response / tickets Managing Distributed Data Data Access Methods (Web Browser, DSpace, OAI-PMH) Storage Repository • Storage location • User name • File name • File context (creation date,…) • Access constraints Naming conventions provided by storage systems Data Grids Provide a Level of Indirection for Each Naming Convention Data Access Methods (C library, Unix, Web Browser) Data Collection Storage Repository Data Grid • Storage location • Logical resource name space • User name • Logical user name space • File name • Logical file name space (URID) • File context (creation date,…) • Logical context (metadata) • Access constraints • Control/consistency constraints Data is organized as a shared collection Provide Context for Data • Properties of files • Provenance - source • Descriptive attributes • State information resulting from operations on files • Organize properties as metadata in a collection hierarchy • Define operations on file properties • Manage state information - location, replicas, containers, checksums • Separate context management from content management • Maintain consistency of context as operations are done on content • Support context management • Schema extension, automated SQL generation, bulk metadata load • Metadata extraction through a remote procedure parsing the file Federated Server Architecture Read Application Logical Name Or Attribute Condition Peer-to-peer Brokering Parallel Data Access 1 6 SRB server 3 SRB server 4 SRB agent 5 SRB agent 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control 5/6 2 R1 MCAT Data Access R2 Server(s) Spawning Storage Resource Broker - Data Grid Application C, C++, Java Linux Libraries I/O Unix Shell Java, NT Browser Kepler Actors DLL / Python, Perl HTTP DSpace OpenDAP OAI, WSDL, WSRF Federation Management Consistency & Metadata Management / Authorization,Authentication,Audit Logical Name Space Catalog Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix Latency Management Data Transport Metadata Transport Storage Repository Virtualization Databases Archives - Tape, Sam-QFS, DMF, ORB File Systems DB2, Oracle, Sybase, Unix, NT, SQLserver, Postgres, HPSS, ADSM, Mac OSX mySQL, Informix UniTree, ADS Scenario 2 - Data Exchange • Support access controls on the URIDs • Java administration GUI to support owner control of access controls • Can delegate permission to set access controls • Access controls apply on all replicas independent of storage system • Support latency management for moving files across wide area networks • Parallel I/O, replication, staging, aggregation of data / metadata / I/O commands • Support integrity validation • Manage checksums for each file Latency Management -Bulk Operations • Bulk register • Create a logical name for a file • Bulk load • Create a copy of the file on a data grid storage repository • Bulk unload • Provide containers to hold small files and pointers to each file location • Bulk delete • Mark as deleted in metadata catalog • After specified interval, delete file • Bulk metadata load • Support parsing of metadata from a remote file at remote storage • Requests for bulk operations for access control setting, … Scenario 3 - Community Access • Within the shared collection, the digital entities are owned and managed by the data grid • Files, URLs, SQL commands, database binary large objects can be registered into the shared collection • Access controls for • Files / metadata / storage systems • Access controls are defined for multiple roles • • • • • • Schema extension, create new metadata Modify metadata Add annotations Turn on audit trails Write data Read data Scenario 4 - Explorative Studies • Uniform access mechanisms to data across all storage systems • Support for queries on databases • Support for formatting results (XML, HTML) • Support audit trails, encryption • Support user-defined collection hierarchy • Soft links (build a logical collection of pointers to data within the data grid) • Support for multiple types of discovery • By URID (Logical File Name) • By query on metadata (may be unique to a single file) • By GUID (handle system) Scenario 5 - Education • SRB is used to build digital libraries • Assemble class material • Manage student reports • Display material through web browsers • Federation of digital libraries • Controlled sharing across independent data grids or digital libraries • Support for cross-registration of logical name spaces • Authentication done by “home” data grid • Access controls managed by both data grids Federation Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Collection A Data Grid Data Collection B Data Grid • Logical resource name space • Logical resource name space • Logical user name space • Logical user name space • Logical file name space • Logical file name space • Logical context (metadata) • Logical context (metadata) • Control/consistency constraints • Control/consistency constraints Access controls and consistency constraints on cross registration of digital entities Scenario 6 - Updating Resources • Maintain system level metadata • Owner of registered file • Creation time, modification time, size, audit trails • Replica locations • Support for synchronization of replicas • Can modify a replica, subsequent reads are to the modified copy • Can synchronize copies to the modified version • Support for physical file containers • Aggregate small files before storage Scenario 7 - Web-based Editions • Support for digital library interfaces on top of the data grid • Transana - technology to manipulate, edit, and manage classroom video (University of Wisconsin) • DSpace - digital library system to manage ingestion of material into a collection • OAI-PMH - Open Archives Initiative protocol for metadata harvesting • OpenDAP - Data Access Protocol that supports both semantic and structural manipulation of registered files • Windows browser, Web browser, Java, WSDL interfaces • Collaborating on development of portlet interface Storage Resource Broker - Data Grid Application C, C++, Java Linux Libraries I/O Unix Shell Java, NT Browser Kepler Actors DLL / Python, Perl HTTP DSpace OpenDAP OAI, WSDL, WSRF Federation Management Consistency & Metadata Management / Authorization,Authentication,Audit Logical Name Space Catalog Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix Latency Management Data Transport Metadata Transport Storage Repository Virtualization Databases Archives - Tape, Sam-QFS, DMF, ORB File Systems DB2, Oracle, Sybase, Unix, NT, SQLserver,Postgres, HPSS, ADSM, Mac OSX mySQL, Informix UniTree, ADS Scenario 8 - Unconnected Editions • Ability to download data from shared collection to local resource • Support for PCs, workstations, supercomputers • Generalization of anonymous FTP • Can issue a ticket permitting • Limited number of read accesses valid for specified time interval • Can set public access to a sub-collection • Can restrict access by user name/domain/zone Local Archives • Maintain files in local file system • Register existence of the files into the data grid • Issue synchronization command to replicate into the archive • Maintain a data grid on the local system • Entire environment can be installed on a Mac in 15 minutes (Perl install script) • Use data grid federation to synchronize name spaces, files, metadata from local data grid to archives data grid Scenario 9 - Collaborative Commmentary • Comments can be added by owner • Annotations can be added by authorized persons • Annotations marked by person name, date • Can restrict annotation right by group • Can choose to create explicit metadata attributes to manage comments • Can store multiple comments per object • Can search across metadata • Or can use digital library interfaces to manage comments Sites Using the SRB Academia Sinica, Taiwan ASCC, Computing Centre, Taiwan Australian National University Bedford Oceanography,Canada Bioinformatics Institute, Singapore CSIRO, Australia Data Storage Institute, Singapore EGEE, French National Center GeoForschungsZentrum, Germany James Cook University, Australia KEK High Energy Physics, Japan Max Planck Institute, Netherlands Parallab, Norway South Australian Advanced Computing UIB (Parallab) , Norway University of Amsterdam University of Cambridge, Astronomy University of Cambridge, e-Science University of Edinburgh University of Genoa, Italy University of Hong Kong Univrsity of Manchester University of Oslo University of Southampton York Univ (UK) CiteSeer, Penn State City Univ. of New York Geospatial Environment, UCSD Drexel University EOSDIS Distributed Active, NASA Goddard Georgia Tech Kentucky State Libraries & Archives Library of Congress Los Alamos National Lab NASA Ames NASA Goddard Space Flight Center NCSA Grid Computing NIH (NCI Center for Bioinformatics) Penn State University Pittsburgh Supercomputing Center Purdue University. Indiana Stanford University TACC, University of Texas Texas A & M UC Santa Cruz UCLA UCSD Neuroscience University of Maryland University of Michigan, CAC department University of New Mexico University of Washington University of Wisconsin USC Yale University GBs of data stored Storage Resource Broker Collections at SDS C (11/2/2004 ) Data Gr id NSF/ITR - National Virtual Observatory NSF - National Partnership for Advanced Computational Infrastructure Hayden Planetarium - Evolution of the Solar System visualizations NSF/NPACI - Joint Center for Structural Genomics NSF/NPACI - Biology and Environmental collections NSF - TeraGrid, ENZO Cosmology simulations К Number Number of files of Users К К 53,858 24,738 7,201 5,228 8,851 121,550 9,536,698 5,754,890 113,600 652,031 33,340 1,096,947 80 380 178 50 67 3,247 NIH - Biomedical Informatics Research Network Digital Library 6,002 К 4,107,508 К 214 NLM - D igital Embryo image collection NSF/NPACI - Long Term Ecological Reserve NSF/NPACI - Grid Portal NIH - Alliance for Cell Signaling microarray d ata NSF - National Science Digital Library SIO Explorer collection NSF/NPACI -Transana education research video collection NSF/ITR - Southern California Earthquake Center 720 253 2,211 856 2,080 92 91,040 45,365 8,436 51,227 62,291 808,901 2,387 1,791,494 Persistent Archive UCSD Libraries archive NARA- Research Prototype Persistent Archive NSF - National Science Digital Library persistent archive TOTAL К К 128 204,828 166 316,813 3,571 26,908,350 328 TB 51 million К 23 36 407 21 27 26 62 К 29 58 122 4,900 Generic Infrastructure • SDSC developed the Storage Resource Broker (SRB) to support access to distributed data • Effort started in 1996 as a DARPA funded project • Now support over 30 national/international projects • Development team of 12 staff is led by • Michael Wan, data management systems • Arcot Rajasekar , information management systems SDSC SRB Team (left to right) QuickTime™ and a QuickTime™ and a IFF (Uncomp resse d) de com press or TIFF (Uncomp resse d) de com press or QuickTime™ and a are nee ded to s ee this picture. are nee ded to s ee this picture. TIFF (Uncomp resse d) de com press or are nee ded to s ee this picture. QuickTime™ and a QuickTime™ and a QuickTime™ and a TIFF (Uncomp resse d) de com press or F (Uncomp resse d) de com press or(Uncomp resse d) de com press or are nee ded to s ee this picture. TIFF are nee ded to s ee this picture. are nee ded to s ee this picture. • • • • • • • • • • • • • • • • • Arun Jagatheesan George Kremenek Sheau-Yen Chen Arcot Rajasekar (SRB development lead) Reagan Moore (SRB PI) Michael Wan (SRB architect) Roman Olschanowsky (BIRN) Bing Zhu Charlie Cowart Lucas Gilbert Tim Warnock Wayne Schroeder (SRB product) Adam Birnbaum (SRB production) Antoine De Torcy Vicky Rowley (BIRN) Marcio Faerman (SCEC) Students & emeritus • • • • • • • • • • Erik Vandekieft Reena Mathew Xi (Cynthia) Sheng Allen Ding Grace Lin Qiao Xin Daniel Moore Ethan Chen Jon Weinburg Supported by overt 20 projects (NSF, DOE, NASA, NARA, NIH, LOC, NHPRC) Data Grid Capabilities • Data manipulation • Containers • Parallel I/O • Firewall interactions • Resource interactions • Fault tolerance • Load leveling • Replication • HIPAA security requirements • • • • • Authentication of all users Access controls on data and metadata Audit trails Data encryption Centralized control • Application interfaces • C library, Shell commands, Java, Perl, Python, WSDL, workflow Data Management System Features • Data grid for managing distributed data • Latency management for bulk analyses of collections • Infrastructure independent name spaces for describing data, resources, users, and state information • Digital library for managing data context • Curation services for managing collections • Descriptive metadata for discovery • Persistent archive to manage technology evolution • Interoperability mechanisms between heterogeneous storage systems and user access mechanisms BIRN - Biomedical Informatics Research Network Data Grid Wash U. Duke NIH/NCRR Centers for Imaging and Computing Cal Tech NPACI/ SDSC UCLA Harvard Cal-(IT)2 “Deep Web” “Surface Web” Duke Integrating Cyber Infrastructure to Link: •Advanced Imaging Instruments •Data Intensive Computing •Multi-Scale Brain Databases Wireless “Pad” Web Interface Digital Library • Collection hierarchy for organizing data • User-defined metadata • Collection level metadata • Metadata manipulation • • • • • Schema extension Bulk metadata processing Queries on metadata Access controls on metadata Views on collections • Digital library APIs • DSpace, Fedora, OAI-PMH, web browsers • METS metadata XML schema Southern California Earthquake Center Select Receiver (Lat/Lon) Store seismic data • Managing over 90 TBs, over 1.7 million files • Store community models for seismic velocity Select Scenario • Data distributed between USC, Fault Model SDSC Source Model SCEC community digital library • Storage Resource Broker data grid technology • NMI portal interface • Digital library services to display seismograms • Visualizations of seismic waves at the surface • Visualization of seismic wave propagation through the volume Output Time History Seismograms SCEC Community Library National Virtual Observatory Virtual Observatory Architecture Discover Compute Publish Collaborate Provide access to large star catalogs and large image sky surveys Portals, User Interfaces, Tools VOPlot Topcat SkyQuery DIS Aladin Registry Layer Data Services HTTP Services Compute Services SOAP Services Grid Services self-describing persistent, crossmatch visualization ADS Digital Library Other registries XML, DC, METS Existing Data Centers OpenSkyQuery OAI image source detection data mining Bulk Access Semantics (UCD) SIAP, SSAP 2MASS SDSS DPOSS USNO-B Macho conVOT interfaces to data stateless, registered authenticated • • • • • OASIS Mirage Virtual Data Workflow (pipelines) Authentication & Authorization My Space storage services Grid Middleware SRB, Globus, OGSA SOAP, GridFTP Databases, Persistency, Replication Disks, Tapes, CPUs, Fiber National Science Digital Library Preserve educational material that has been registered into a central repository at Cornell through URLs • Crawl web and retrieve material, 10 levels of indirection • Convert internal URLs into data grid handles • Aggregate files into containers for storage • Preserve using SRB data grid technology • Currently housing over 26 million files Web Interface to Persistent Archive National Archives and Records Administration Research Prototype Persistent Archive Demonstrate preservation environment • Authenticity • Integrity • Management of technology evolution • Mitigation of risk of data loss • Replication of data • Federation of catalogs • Management of preservation metadata • Scalability • EAP collection • 350,000 files • 1.2 TBs in size Federation of Three Independent Data Grids NARA MCAT Principle copy stored at NARA with complete metadata catalog U Md MCAT Replicated copy at U Md for improved access, load balancing and disaster recovery SDSC MCAT Deep Archive at SDSC, no user access, but complete copy For More Information Reagan W. Moore San Diego Supercomputer Center [email protected] http://www.npaci.edu/DICE http://www.npaci.edu/DICE/SRB http://www.npaci.edu/dice/srb/mySRB/mySRB.html
© Copyright 2026 Paperzz