Comdb2

Comdb2
BLOOMBERG’S HIGHLY AVAILABLE RELATIONAL DATABASE SYSTEM
— ALEX SCOTTI, MARK HANNUM, MICHAEL PONOMARENKO, DORIN HOGEA, AKSHAT SIKARWAR, MOHIT KHULLAR, ADI ZAIMI, JAMES LEDDY,
RIVERS ZHANG, FABIO ANGIUS, LINGZHI DENG
Presenter: Tianyuan Zhang
1
Motivation
SCENARIO:
PRODUCTION ENVIRONMENT & DEMANDS:
A trader from NASDAQ is browsing stock price
using BLP’s terminal during high traffic hour. He
sees a profitable stock and wants to buy in
immediately. All the other trader should be able to
see the same price at the moment and act
accordingly.
◦ Service should be reachable all the time.
◦ Any two application should display same
information.
◦ Relative fast response time.
◦ Huge amount of data, scale up is costly and
impractical.
Scalability, High availability, Full transactional
2
Scalability
◦ Replication over multiple machines.
◦ Trade of between latency and consistency.
Master
◦ Eventual consistency vs strict consistency.
◦ Design choices: OCC vs MVCC vs 2PL
Replicant
Replicant
Replicant
3
Design choice: concurrency control
◦ OCC — Optimistic concurrency control:
◦ Assumption: low data contention
◦ Transactions use data resources without acquiring
locks.
◦ Check for conflict before committing.
◦ Fastest while guarantee concurrency.
◦ 2PL — Two-phase locking:
◦
◦
◦
◦
Work on any condition.
Disallow concurrency.
Guarantees serializability.
Seriously slow down performance.
◦ MVCC — Multiversion concurrency
control:
◦
◦
◦
◦
Concurrency.
clean semantics.
Less efficient with memory and disk space.
Complexity of code.
4
Availability
◦ Synchronously replicate across data-centers.
◦ Client use API to discover services and reconnect when server failed.
◦ Tolerant to any type of outage or maintenance.
◦ Elastic deployment: free to change cluster structure.
◦ Instantaneous schema changes: update schema without rebuild.
5
HASQL failure handling
Begin transaction
Point-in-time
token:100
API:
Token:100
Transaction:
AAAAA
BBBBB
CCCCC
DDDD
S1
LSN:100
Master
LSN:100
S2
LSN:101
Snapshot:100
6
Full Transactional Support
◦ Atomicity: provided by write-ahead log (WAL) protocol .
◦ Consistency: Synchronous replication of data, OCC.
◦ Isolation: Block, Read Committed, Snapshot Isolation, Serializable
◦ Durability: WAL, network commit.
7
Database Isolation level
none
serializable
Write
skew
Phantoms
read
Nonrepeatable
read
Snapshot Isolation
Repeatable Read
Read Committed [default]
Block Isolation
Read uncommitted
Dirty Reads
8
Life cycle
Green: OCC phase
Red: 2PL phase
9
IMPLEMENTATION
Storage Layer
Replication
Cdb2 Layer
10
Storage Layer
◦ Uses B-trees to store every type of data
◦ Multiple B-trees form a table.
◦ Improve on BerkeleyDB:
1. Row Locks — finer granularity compare to page lock
2. Prefaulting(readahead):
◦ B-tree readahead
◦ Local prefaulting
◦ Remote prefaulting
11
Storage Layer(cont.)
3. Root caching---allow concurrent read
4. Compression---trade-off CPU cycles for less disk I/O
5. Concurrent I/O---Multi-threading flush pages.
12
Replication logic
◦ Use of LSN(Log Sequence Number)
◦ Performance concerns of synchronous system:
◦ Durability in comdb2:
◦ Network commit
◦ Early ack
◦ Preserving concurrency when send log to replicant.
◦ Coherency model
◦ To keep updated
◦ short term leases
13
Eliminating dirty reads
Master
S0
C1
S1
S2
S4
S3
C2
14
Eliminating dirty reads
S0
S1
S2 new
Master
S4
S3
C2
15
Cdb2 Layer
Data organization:
◦ Genid:
counter(48)
◦ Row header:
length(28)
update-id(12)
update-id(12)
stripe-id(4)
schema version(8)
flags(8)
shadow trees:
16
Cdb2 Layer (cont.)
◦ BPLog — translation of SQL to low level execution
◦ Master execute bplog using 2PL -> WAL
◦ resilient to master reelection.
◦ Schema changes
◦
◦
◦
◦
Declarative schema change
phoenix transaction
Compatible change: lazy substitution.
Incompatible change—rebuild hidden version in background.
◦ Reads occur against the original table.
◦ Writes are performed on both the regular and the hidden version of the table.
17
TUNING
Trading off
performance with
consistency
01
02
03
04
By size
By type
Tuning
incoherence
Asynchronous
Logging
18
6 node cluster spread across 2
datacenters
Evaluation
19
Limitaion & future work
◦ Can not write to tables that exist in remote databases.
◦ User 2PC across database replication groups.
◦ Write operation can not linearly scaled with more machine.
◦ Saturate on a single machine’s ability to process the low level bplogs.
◦ Solution: multi-master systems or partitioned data sets.
20
Questions and discussion
What does the master do if it receives no acknowledgement from the replicants and how long
does it take to progress?
OCC has it’s limitations if the write to read ratio is high.
Do Replicas just work as stand-by nodes for the case of failure?
How does it compare to NoSQL systems? How does it perform is geo-distributed databases?
21