CSCI5570 Large Scale Data Processing Systems NewSQL James Cheng CSE, CUHK Slide Ack.: modified based on the slides from Joy Arulraj The End of an Architectural Era (It’s time for a complete rewrite) Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, Pat Helland VLDB 2007 2 Outline • The current state of the world • Why current architecture is aged • How to beat it by a factor of 50 in every market I can think of • Implications for the research community 3 Motivation • System R (1974) – Seminal database design from IBM – First implementation of SQL • Hardware has changed a lot over 3 decades – Databases still based on System R’s design – Includes DB2, SQL server, etc. 4 System R Architectural Features • Disk oriented storage and indexing structures • Multi-threading to hide latency • Locking-based concurrency control mechanisms • Log-based recovery 5 Current DBMS Standard • • • • • Store fields in one record contiguously on disk Use B-tree indexing Use small (e.g. 4K) disk blocks Align fields on byte or word boundaries Conventional (row-oriented) query optimizer and executor 6 OLTP Bottlenecks 7 OLTP: Where does the time go? 8 Terminology – “Row Store” Record 1 Record 2 Record 3 Record 4 E.g. DB2, Oracle, Sybase, SQLServer, … 9 Row Stores • Can insert and delete a record in one physical write • Good for business data processing • And that was what System R and Ingres were gunning for 10 Extensions to Row Stores Over the Years • Architectural stuff (Shared nothing, shared disk) • Object relational stuff (user-defined types and functions) • XML stuff • Warehouse stuff (materialized views, bit map indexes) • … 11 At This Point, RDBMS is aged • There are at least 4 (non trivial) markets where a row store can be clobbered by a specialized architecture (CIDR 07 paper) – Warehouses (Vertica, SybaseIQ, KX, …) – Text (Google, Yahoo, …) – Scientific data (MatLab, ASAP, …) – Streaming data (StreamBase, Coral8, …) 12 At This Point, RDBMS is aged • Leaving RDBMS with only the OLTP market (i.e., business data processing) • But they are no good at that either!!!!!! 13 New OLTP Proposal • First part – – – – – Main memory No multi-threading Grid orientation Transient undo and no redo No knobs • Second part – – – – JDBC/OCDB overhead Concurrency control Undo 2 phase commit 14 New OLTP Proposal • First part – – – – – Main memory No multi-threading Grid orientation Transient undo and no redo No knobs • Second part – – – – JDBC/OCDB overhead Concurrency control Undo 2 phase commit 15 Main Memory Deployment • 1970’s: disk • Now: main memory TPC-C is 100 Mbytes per warehouse; 1000 warehouses is a HUGE operation; i.e. 100 Gbytes; i.e. main memory 16 Main Memory Deployment • 1970’s: terminal operator • Now: unknown client over the web Cannot allow user stalls inside a transaction!!!!! Hence, there are no user stalls or disk stalls!!!!! 17 No Multi-threading • Heaviest TPC-C Xact reads/writes 400 records – Less than 1 msec!! • Run all commands to completion; single threaded • Dramatically simplifies DBMS – No concurrent B-tree, no multi-threaded data structure – No pool of file handles, buffers, threads, … => no resource governor! 18 Grid Computing • Obviously cheaper • Obvious wave of the foreseeable future (replacing shared disk) • Horizontally partition data – Shared nothing query optimizer and executor • Add/delete sites on the fly required High end OLTP has to “scale out” not “scale up” 19 High Availability • 1970’s: disaster recovery was “tape shipping” • Now: 7 x 24 x 365 no matter what • Some organizations today run a hot standby: a second machine sits idle waiting to take over if first one fails => only half of the resource used 20 Peer-to-Peer HA • Multiple machines in a peer-to-peer configuration • OLTP load dispersed across multiple machines • Inter-machine replication used for fault tolerance • Redundancy (at the table level) in the grid • Optimizer chooses which instance of a table to read, writes all instances (transactionally) 21 Logging • Undo log for roll-back if a transaction fails, but can be deleted on transaction commit • No redo log, simply recover data from an operational site 22 Recovery in a K-safe Environment • • • • • Restore dead site Query up sites for live data When up to speed, join the grid Stop if you lose K+1 sites No redo log!!!! Vertica has shown this to be perfectly workable 23 No Knobs • RDBMSs have a vast array of complex tuning knobs • New design should be self-everything – self-healing – self-maintaining – self-tuning – self-… 24 Main Sources of Overhead in RDBMS • • • • • • • • • Disk I/O (gone) Resource control (gone) Synchronization (gone) Latching with multi-threaded data structure (gone) Redo log (gone) Undo log (but in main memory and discard on commit) JDBC/ODBC interface Dynamic locking for concurrency control 2 phase commit (for multi-site updates and copies) 25 New OLTP Proposal • First part – – – – – Main memory No multi-threading Grid orientation Transient undo and no redo No knobs • Second part – – – – JDBC/OCDB overhead Concurrency control Undo 2 phase commit 26 H-Store System Architecture • A grid of computers • Rows of tables are placed contiguously in main memory, with B-tree indexing • Each H-Store site is single-threaded • Multi-cores => multiple logical sites (one site for each core) per physical site • Main memory on physical site partitioned among logical sites 27 Stored Procedures • OLTP has changed – 1970’s: conversational transactions – Now: can ask for all of them in advance • Applications use stored procedures => JDBC/ODBC overhead (gone) 28 Transaction Classes • Classify transactions in OTLP applications into classes and use their properties • Get all transaction classes in advance – Instances differ by run-time parameters • Construct a physical data base design (manually now; automatically in the future) – Table partitioning – Table-level replication • Create a query plan for each class 29 Transaction Classes • Example – Class : “Insert record in History where customer = $(customer-Id) ; more SQL statements ;” – Runtime instance supplies $(customer-Id), etc. • Each transaction class has certain properties – Optimize concurrency control protocols – And commit protocols 30 Transaction Classes • Transaction classes – Constrained tree applications – Single-site transactions – One-shots transactions – Two-phase transactions – Sterile transactions • Prevalent in major commercial online retail applications • H-Store makes use of their properties 31 Constrained Tree Application Every transaction has equality predicates on the primary key of the root node Customer Order Order Order Line Order Line Partition 1 Order Order Line Order Line Order Line Order Line Partition 2 32 Single-sited transactions • All queries hit same partition • Every transaction run to completion at a single site • Constrained Tree Application – Root table can be horizontally hash-partitioned – Collocate corresponding shards of child tables – No communication between partitions 33 Single-sited transactions • CTAs are common in OLTP • Making non-CTAs single-sited – remove read-only tables in the schema from application and check if it now becomes CTA – if yes, replicate these tables at all sites • One-shot transactions 34 One-shot transactions • One-shot transaction – execute in parallel without requiring intermediate results to be communicated among sites – no inter-query dependencies • One-shot transaction can be decomposed into single-sited plans • Transactions in many applications can often be made one-shot by – vertical partitioning of tables among sites – replicating read-only columns 35 Two-phase classes • A transaction class is two-phase if – Phase 1 : read-only operations (transaction may be aborted based on the query results) – Phase 2 : queries and updates can’t violate integrity • A transaction class is strongly two-phase if – Two phase and – Phase 1 operations on all sites produce the same result wrt aborting or continuing 36 Sterile classes • Two concurrent transactions commute when any interleaving of their single-site sub-plans produces the same final database state as any other interleaving (if both commit) • A transaction class is sterile if – Its transactions commute with transactions of all other classes (including itself) 37 Query Execution • Cost-based query optimizer producing query plans based on transaction classes at transaction definition time – Single-sited: dispatch to the appropriate site – One-shot: decompose into a set of plans, each executed at a single site • Standard run-time optimizer for general transactions as sites may communicate 38 Transaction Management • Every transaction receives a timestamp (site_id, local_unique_timestamp) • Given an ordering of sites, timestamps are unique and form a total order 39 Transaction Management • Single-sited or one-shot – each transaction dispatched to replica sites and executed to completion – unless sterile, each execution site waits a small period of time (account for network delays) for transactions to arrive => execution in timestamp order => all replicas updated in the same order => identical outcome at each replica, all commit or all abort => no data inconsistency! – no redo log, no concurrency control, no distributed commit processing! 40 Transaction Management • Two-phase – no integrity violation – no undo log – if also single-sited/one-shot, no transaction facilities at all! 41 Transaction Management • Sterile – no concurrency control needed – no need of execution order of transactions – no guarantee on all sites abort/continue • workers respond “abort” or “continue” • execution supervisor communicates the info to worker sites • standard distributed commit processing needed unless transaction is strongly two-phase 42 Transaction Management • Non-sterile, non single-sited, non one-shot – do not use dynamic locking (expensive for shortlived transactions) for transaction consistency, instead: – first run with basic strategy – if too many aborts, run intermediate strategy – if still too many aborts, escalate to advanced strategy 43 Transaction Management • Basic Strategy – timestamp ordering of subplan pieces – wait for “small period of time” to preserve timestamp order – execute the subplan – if no abort, continue with next subplan – if no more subplan, commit 44 Transaction Management • Intermediate Strategy – increase wait time to sequence the subplans, thereby lowering abort probability • Advanced Strategy – track read set and write set of each transaction at each site – worker site runs each subplan, and aborts (if necessary) by standard optimistic concurrency control rules 45 Results • H-Store – Targets OLTP workload – Shared-nothing main memory database • TPC-C benchmark – All classes made two-phase → No coordination – Replication + Vertical partitioning → One-shot – All classes still sterile in this schema → No waits 46 Results RDBMS: a very popular commercial RDBMS Best TPC-C: best record on TPC website 47 TPC-C Performance on a Low-end Machine • A very popular commercial RDBMS (or the elephants) – 850 transactions/sec • H-Store – 70,416 transactions/sec Factor of 82!!!!! 48 Implications for the Elephants • They are selling “one size fits all” • Which is 30 year old legacy technology that is good at nothing • The elephants: a collection of OLTP systems, connected to ETL, and connected to one or more data warehouses – ETL: extract-transform-load, used to convert OLTP data to a common format and load it into a data warehouse (for BI queries) 49 The DBMS Landscape – Performance Needs in-between Other apps complex operations read-focus Data Warehouse simple operations write-focus OLTP 50 One Size Does Not Fit All in-between Other apps Big Table, etc Elephants get only “the crevices” Open source Vertica/ C-Store Data Warehouse complex operations read-focus H-Store OLTP simple operations write-focus 51 Summary • “One size fits all” databases excel at nothing • H-Store – Clean design for OLTP domain from scratch 52
© Copyright 2026 Paperzz