Pursuit of a Scalable High Performance Multi-Petabyte Database 16th IEEE Symposium on Mass Storage Systems Andrew Hanushevsky SLAC Computing Services Marcin Nowak CERN Produced under contract DE-AC03-76SF00515 between Stanford University and the Department of Energy Andrew Hanushevsky 17-Mar-99 1 High Energy Experiments BaBar at SLAC High precision investigation of B-meson decays Explore the asymmetry between matter and antimatter Where did all the antimatter go? ATLAS at CERN Probe the Higgs boson energy range Explore the more exotic reaches of physics Andrew Hanushevsky 17-Mar-99 2 High Energy Physics Quantitative Challenge Experiment Starts Data Volume ATLAS/CERN May 2005 5.0 petabytes/yr 2.0 petabytes 100 petabytes Aggregate xfr rate 200 MB/sec disk 60 MB/sec tape Processing power 5,000 SPECint95 Total amount BaBar/SLAC May 1999 0.2 petabytes/yr SPARC Ultra 10’s Physicists Locations Countries Andrew Hanushevsky 526 800 87 9 100 GB/sec disk 1 GB/sec tape 250,000 SPECint95 27,000 3,000 250 50 17-Mar-99 3 Common Elements Data will be stored in an Object Oriented database Objectivity/DB Most data will be kept offline HPSS Has theoretical ability to scale to size of experiments Heavy duty, industrial strength mass storage system BaBar will be blazing the path First large scale experiment to use this combination The year of the hare will be a very interesting time Andrew Hanushevsky 17-Mar-99 4 Objectivity/DB Client/Server Application Primary access is through the Advanced Multithreaded Server (AMS) AMS serves “pages” (512 to 64K byte blocks) Can have any number of AMS’ Similar to other remote filesystem interfaces (e.g., NFS) Objectivity client can read and write database “pages” via AMS Pages range from 512 bytes to 64K in powers of 2 (e.g., 1K, 2K, 4K, etc.) ufs protocol ams protocol Andrew Hanushevsky 17-Mar-99 5 High Performance Storage System Control #Bitfile Server Network #Name Server #Storage Servers #Physical Volume Library # Physical Volume Repositories #Storage System Manager #Migration/Purge Server #Metadata Manager #Log Daemon #Log Client #Startup Daemon #Encina/SFS #DCE Data Network Andrew Hanushevsky 17-Mar-99 6 The Obvious Solution Mass Storage System Compute Farm Network Switch Database Servers External Collaborators But… the devil is in the details Andrew Hanushevsky 17-Mar-99 7 Capacity and Transfer Rate GB Capacity MB/Sec 1024 Tape Cartridge Capacity 512 Disk System Capacity 256 128 384 64 192 32 96 16 48 8 24 4 12 2 6 1 3 88 90 92 94 96 98 00 02 04 Disk Transfer Rate Tape Transfer Rate 06 Year Andrew Hanushevsky 17-Mar-99 8 The Capacity Transfer Rate Gap Density growing faster than ability to transfer data We can store the data just fine, but do we have the time to look at it? There are solutions short of poverty Stripped tape? Intelligent staging Primary access on RAID devices Cost/Performance is still a problem Need to address UFS scaling problem Replication - a fatter pipe? Only if you want a lot of headaches Data synchronization problem Load balancing issues Whatever the solution is, you’ll need lot of them Andrew Hanushevsky 17-Mar-99 9 Part of the solution: Together Alone HPSS Highly scalable, excellent I/O performance for large files but AMS Efficient database protocol and highly flexible but High latency for small block transfers (i.e., Objectivity/DB) Limited security, tied to local filesystem Need to synergistically mate these systems Andrew Hanushevsky 17-Mar-99 10 Opening up new vistas: The Extensible AMS oofs interface System specific interface Andrew Hanushevsky 17-Mar-99 11 As big as it gets: Scaling The File System Veritas Volume Manager Veritas File System Catenates disk devices to form very large capacity logical devices High performance (60+ MB/Sec) journaled file system for fast recovery Combination used as HPSS staging target Allows for fast streaming I/O and efficient small block transfers Andrew Hanushevsky 17-Mar-99 12 Not out of the woods yet: Other Issues Access Patterns Random vs sequential Staging latency Scalability Security Andrew Hanushevsky 17-Mar-99 13 No prophets here: Supplying Performance Hints Need additional information for optimum performance Different from Objectivity clustering hints Information is Objectivity independent Database clustering Processing mode (sequential/random) Desired service levels Need a mechanism to tunnel opaque information Client supplies hints via oofs_set_info() call Information relayed to AMS in a transparent way AMS relays information to underlying file system via oofs() Andrew Hanushevsky 17-Mar-99 14 Where’s the data? Dealing With Latency... Hierarchical filesystems may have high latency bursts Mounting a tape file Need mechanism to notify client of expected delay Prevents request timeout Prevents retransmission storms Also allows server to degrade gracefully Can delay clients when overloaded Defer Request Protocol Certain oofs() requests can tell client of expected delay For example, open() Client waits indicated amount of time and tries again Andrew Hanushevsky 17-Mar-99 15 Many out of one: Dynamically Replicated Databases Dynamically distributed databases Single machine can’t manage over a terabyte of disk cache No good way to statically partition the database Dynamically varying database access paths As load increases, add more copies Copies accessed in parallel As load decreases, remove copies to free up disk space Objectivity catalog independence Copies managed outside of Objectivity Minimizes impact on administration Andrew Hanushevsky 17-Mar-99 16 If There are many, which One Do I Go To? Request Redirect Protocol oofs () routines supply alternate AMS location oofs routines responsible for update synchronization Typically, read/only access provided on copies Only one read/write copy conveniently supported Client must declare intention to update prior to access Lazy synchronization possible Good mechanism for largely read/only databases Load balancing provided by an AMS collective Has one distinguished member recorded in the catalogue Andrew Hanushevsky 17-Mar-99 17 The AMS Collective Collective members are effectively interchangeable redirect Distinguished Members Andrew Hanushevsky 17-Mar-99 18 Keeping the hackers at bay: Object Oriented Security No performance is sufficient if you have to always recompute Need mechanism to provide security to thwart hackers Protocol Independent Authentication Model Public or private key PGP, RSA, Kerberos, etc. • Can be negotiated at run-time Automatically called by client and server kernels Client Objectivity Kernel creates security objects as needed Supplied via replaceable shared libraries Security objects supply context-sensitive authentication credentials Works only with Extensible AMS via oofs interface Andrew Hanushevsky 17-Mar-99 19 Overall Effects Extensible AMS Generic Authentication Protocol Allows passing of hints to improve filesystem performance Defer Request Protocol Allows proper client identification Opaque Information Protocol Allows use of any type of filesystem via oofs layer Accommodates hierarchical filesystems Redirection Protocol Accommodates terabyte+ filesystems Provides for dynamic load balancing Andrew Hanushevsky 17-Mar-99 20 Dynamic Load Balancing Hierarchical Secure AMS Dynamic Selection Andrew Hanushevsky 17-Mar-99 21 Summary AMS is capable of high performance Ultimate performance limited by disk speeds The oofs interface + other protocols greatly enhance performance, scalability, usability, and security 5+TB of SLAC data has been processed using AMS+HPSS Should be able to deliver average of 20 MB/Sec per disk Some AMS problems No HPSS problems SLAC will be using this combination to store physics data BaBar experiment will produce over a 2 PB database in 10 years 2,000,000,000,000,000 = 21015 bytes @ 200,000 3590 Tapes Andrew Hanushevsky 17-Mar-99 22 Now for the reality Full AMS features not yet implemented SLAC/Objectivity design has been completed oofs and ooss layers are completely functional DRP, RRP, and GAP Initial feature set to be deployed late summer HPSS integration is full-featured and complete Protocol development has been fully funded at SLAC oofs OO interface, OO security, protocols (I.e., DRP, RRP, and GAP) DRP, GAP, and limited RRP Full asynchronous replication within 2 years CERN & SLAC approaches similar But quite different in detail…. Andrew Hanushevsky 17-Mar-99 23 CERN staging approach: RFIO/RFCP + HPSS AMS File & catalog management RFIO calls Stage-in requests HPSS Server UNIX FS I/O DB pages Disk Server RFIO daemon HPSS Mover Migration daemon (Solaris) Tape Robot Disk Pool Andrew Hanushevsky RFCP (RFIO copy) 17-Mar-99 24 SLAC staging approach: PFTP + HPSS AMS File & catalog management Gateway Requests Stage-in requests UNIX FS I/O DB pages Disk Server PFTP (control) Gateway daemon HPSS Server HPSS Mover Migration daemon (Solaris) Tape Robot Disk Pool Andrew Hanushevsky PFTP (data) 17-Mar-99 25 SLAC ultimate approach: Direct Tape Access AMS HPSS Server File & catalog management Stage-in requests Native API (rpc) UNIX FS I/O DB pages Disk Server Migration daemon HPSS Mover (Solaris) Tape Robot Disk Pool Andrew Hanushevsky Direct Transfer 17-Mar-99 26 CERN 1TB Test Bed HPSS Data Mover current approximation future 1Gb switched ether star topology IBM RS6000 HPSS Server RFIO daemon IBM Tape Silo FDDI HIPPI Fast Ethernet IBM RS6000 SUN Sparc 5 AMS/HPSS Interface SUN Sparc 5 HPSS Data Mover Staging Pool DEC Alpha Andrew Hanushevsky 17-Mar-99 27 SLAC Configuration approximate AMS Server HPSS Mover Sun 4500 900 G Sun 4500 HPSS Server AMS Server HPSS Mover Gigabit Ethernet IBM RS6000 F50 AMS Server HPSS Mover Sun 4500 B Sun 4500 Andrew Hanushevsky AMS Server HPSS Mover AMS Server HPSS Mover 17-Mar-99 28 SLAC Detailed Configuration Andrew Hanushevsky 17-Mar-99 29
© Copyright 2026 Paperzz