Comdb2 BLOOMBERG’S HIGHLY AVAILABLE RELATIONAL DATABASE SYSTEM — ALEX SCOTTI, MARK HANNUM, MICHAEL PONOMARENKO, DORIN HOGEA, AKSHAT SIKARWAR, MOHIT KHULLAR, ADI ZAIMI, JAMES LEDDY, RIVERS ZHANG, FABIO ANGIUS, LINGZHI DENG Presenter: Tianyuan Zhang 1 Motivation SCENARIO: PRODUCTION ENVIRONMENT & DEMANDS: A trader from NASDAQ is browsing stock price using BLP’s terminal during high traffic hour. He sees a profitable stock and wants to buy in immediately. All the other trader should be able to see the same price at the moment and act accordingly. ◦ Service should be reachable all the time. ◦ Any two application should display same information. ◦ Relative fast response time. ◦ Huge amount of data, scale up is costly and impractical. Scalability, High availability, Full transactional 2 Scalability ◦ Replication over multiple machines. ◦ Trade of between latency and consistency. Master ◦ Eventual consistency vs strict consistency. ◦ Design choices: OCC vs MVCC vs 2PL Replicant Replicant Replicant 3 Design choice: concurrency control ◦ OCC — Optimistic concurrency control: ◦ Assumption: low data contention ◦ Transactions use data resources without acquiring locks. ◦ Check for conflict before committing. ◦ Fastest while guarantee concurrency. ◦ 2PL — Two-phase locking: ◦ ◦ ◦ ◦ Work on any condition. Disallow concurrency. Guarantees serializability. Seriously slow down performance. ◦ MVCC — Multiversion concurrency control: ◦ ◦ ◦ ◦ Concurrency. clean semantics. Less efficient with memory and disk space. Complexity of code. 4 Availability ◦ Synchronously replicate across data-centers. ◦ Client use API to discover services and reconnect when server failed. ◦ Tolerant to any type of outage or maintenance. ◦ Elastic deployment: free to change cluster structure. ◦ Instantaneous schema changes: update schema without rebuild. 5 HASQL failure handling Begin transaction Point-in-time token:100 API: Token:100 Transaction: AAAAA BBBBB CCCCC DDDD S1 LSN:100 Master LSN:100 S2 LSN:101 Snapshot:100 6 Full Transactional Support ◦ Atomicity: provided by write-ahead log (WAL) protocol . ◦ Consistency: Synchronous replication of data, OCC. ◦ Isolation: Block, Read Committed, Snapshot Isolation, Serializable ◦ Durability: WAL, network commit. 7 Database Isolation level none serializable Write skew Phantoms read Nonrepeatable read Snapshot Isolation Repeatable Read Read Committed [default] Block Isolation Read uncommitted Dirty Reads 8 Life cycle Green: OCC phase Red: 2PL phase 9 IMPLEMENTATION Storage Layer Replication Cdb2 Layer 10 Storage Layer ◦ Uses B-trees to store every type of data ◦ Multiple B-trees form a table. ◦ Improve on BerkeleyDB: 1. Row Locks — finer granularity compare to page lock 2. Prefaulting(readahead): ◦ B-tree readahead ◦ Local prefaulting ◦ Remote prefaulting 11 Storage Layer(cont.) 3. Root caching---allow concurrent read 4. Compression---trade-off CPU cycles for less disk I/O 5. Concurrent I/O---Multi-threading flush pages. 12 Replication logic ◦ Use of LSN(Log Sequence Number) ◦ Performance concerns of synchronous system: ◦ Durability in comdb2: ◦ Network commit ◦ Early ack ◦ Preserving concurrency when send log to replicant. ◦ Coherency model ◦ To keep updated ◦ short term leases 13 Eliminating dirty reads Master S0 C1 S1 S2 S4 S3 C2 14 Eliminating dirty reads S0 S1 S2 new Master S4 S3 C2 15 Cdb2 Layer Data organization: ◦ Genid: counter(48) ◦ Row header: length(28) update-id(12) update-id(12) stripe-id(4) schema version(8) flags(8) shadow trees: 16 Cdb2 Layer (cont.) ◦ BPLog — translation of SQL to low level execution ◦ Master execute bplog using 2PL -> WAL ◦ resilient to master reelection. ◦ Schema changes ◦ ◦ ◦ ◦ Declarative schema change phoenix transaction Compatible change: lazy substitution. Incompatible change—rebuild hidden version in background. ◦ Reads occur against the original table. ◦ Writes are performed on both the regular and the hidden version of the table. 17 TUNING Trading off performance with consistency 01 02 03 04 By size By type Tuning incoherence Asynchronous Logging 18 6 node cluster spread across 2 datacenters Evaluation 19 Limitaion & future work ◦ Can not write to tables that exist in remote databases. ◦ User 2PC across database replication groups. ◦ Write operation can not linearly scaled with more machine. ◦ Saturate on a single machine’s ability to process the low level bplogs. ◦ Solution: multi-master systems or partitioned data sets. 20 Questions and discussion What does the master do if it receives no acknowledgement from the replicants and how long does it take to progress? OCC has it’s limitations if the write to read ratio is high. Do Replicas just work as stand-by nodes for the case of failure? How does it compare to NoSQL systems? How does it perform is geo-distributed databases? 21
© Copyright 2026 Paperzz