Schism - CUHK CSE

CSCI5570 Large Scale Data
Processing Systems
NewSQL
James Cheng
CSE, CUHK
Slide Ack.: modified based on the slides from Rui Zhang
Schism: a Workload-Driven Approach to
Database Replication and Partitioning
Carlo Curino, Evan Jones, Yang Zhang, Sam Madden
VLDB 2010
2
Motivation
• Why partitioning?
– Scalability: process partitions in parallel
– Availability: replicate partitions
– Manageability: rolling upgrades and configuration
changes to be made on one partition at a time
3
Motivation
• Distributed transactions are expensive
– Consensus protocol
– Distributed locks
– More communication
4
Why expensive?
• Two-Phase Commit
proposal (e.g., “commit
transaction tid”)
A
decision (commit/abort)
Vote (yes/no)
B
TC
Phase 1: proposal
A
OK
B
TC
C
C
D
D
Phase 2: decision
5
Throughput of distributed transactions
6
Solution
• Minimize distributed transactions by
partitioning
– Good partitioning can reduce the number of
distributed transactions, as more transactions fall
into the same partition, thus executed at a single
site
• Balance the workload of each partition
7
Existing Methods
• Traditional Partitioning
– Total replication
– Round-Robin
– Range
– Hash
• But many n-to-n relationship hard to partition
– common in schemas of social networking web
sites
8
Schism
• Input
– Workload trace
• Output
– Partitioning and replication plan
• Five steps
–
–
–
–
–
Data pre-processing
Creating the graph
Partitioning the graph
Explaining the partition
Final validation
9
Schism
• Two phases
• Phase 1
– workload-driven, graph-based partition and
replication
– each partition/replication is assigned to one
physical node
• Phase 2
– explaining and validating the partitioning
10
Creating the graph
• Node: tuple
• Edge: transaction that accesses the two end
nodes (tuples)
• Weight: number of transactions that access
the two tuples
• The graph characterizes both the database
and the queries over it
11
Creating the graph
transaction edges
BEGIN
UPDATE account SET bal=bal-1k WHERE name="carlo";
UPDATE account SET bal=bal+1k WHERE name="evan";
COMMIT
BEGIN
UPDATE account SET bal=60k
WHERE id=2;
SELECT * FROM account
WHERE id=5;
COMMIT
2
name
balance
1
carlo
80k
2
evan
160k
3
sam
129k
4
eugene
29k
5
yang
12k
…
…
…
1
1
1
account
id
3
1
1
5
BEGIN
SELECT * FROM account WHERE id IN {1,3}
ABORT
BEGIN
UPDATE SET bal=bal+1k WHERE bal < 100k;
COMMIT
1
1
4
Partition 0
Partition 1
12
Tuple Replication
• Extend the basic graph
• Create n+1 nodes for each tuple (n is the
number of transactions that access the tuple)
• Create a star out of the n+1 nodes, assign
weight of each edge as the number of
transactions that update the tuple
(representing the cost of replicating the tuple)
13
Tuple Replication
14
Tuple Replication
• Allow partitioning algorithm to balance the
costs and benefits of replication
– replicate a tuple across multiple partitions => pay
the cost for distributed updates (e.g., tuple 1)
– place it in a single partition => pay the cost for
distributed transactions (e.g. tuple 2)
• Min-cut based algorithm
– not to replicate tuples that are updated frequently
– replicate rarely updated tuples
15
Graph Partitioning
• Metis algorithm
G1
…
…
– Coarsening
– Initial partitioning
– Refinement
See www.cs.umn.edu/~metis for more detail
• Goal
– Lookup table
…
• Output
…
– Minimum cut and balanced partitions
G𝑛
initial partitioning
16
Graph Size Reduction
• Sampling
– transaction level
– tuple level
• Blanket-statement filtering
– discard occasional statements that scan large portions of a
table, since (1) they produce many edges w/ little info and
(2) the cost of processing statements is higher than the
cost of distributed transaction
• Relevance filtering
– remove tuples that are accessed very rarely since they
carry little info
17
Graph Size Reduction
• Star-shaped replication
– connecting replica nodes in a star-shape than a clique,
limiting number of edges
• Tuple coalescing
Clique
Star-shape
– coalesce tuples that are always accessed together as a
single node
18
Explanation Phase
• Aim to find a compact model that captures the
(tuple, partition) mappings produced by the
partitioning phase
• Use decision tree to produce understandable
rule-based output
– Input: (value, label) pairs
– Output: a tree of predicates over values leading to
leaves w/ specific labels
– Use: given an unlabeled value, find a label by
descending the tree, applying predicates at each node
until reaching a labeled leave
19
Explanation Phase
• Method
– Decision Tree
• Input
– (tuple, partition_set)
• Output
– A set of rules
Tuple id
Partition label
1
R = {0,1}
2
0
3
0
4
1
5
1
(id = 1) -> partitions = {0, 1}
(2 <= id < 4) -> partition = 0
(id >= 4) -> partition = 1
20
Explanation Phase
• An explanation is only useful if
– it is based on attributed used frequently in the
queries (needed to route transactions to a single
site and avoid expensive broadcasts)
– it does not reduce the partitioning quality too
much by misclassifying tuples
– it produces an explanation that works for
additional queries (no overfitting)
21
Explanation Phase
• Candidate attributes
– Frequently used attributes in WHERE clauses
– Join predicates
• Avoid over-fitting by aggressive pruning and
cross-validation
22
Validation Phase
• Compare different solutions
– Fine-grained per-tuple graph partitioning
– Range-predicate partitioning by decision tree
– Hash-partitioning on most frequently used
attributed
– Full-table replication
• Choose one w/ smallest number of distributed
transactions
23
Evaluation
•
•
•
•
Schism
Manual
Total Replication
Hash partitioning
24
Evaluation
25
Evaluation
TPC-E (3M nodes; 100M edges)
TPC-C 50W (2.5M nodes; 65M edges)
Epinions.com (600k nodes; 5M edges)
26
Limitations
• Not suitable for insert-heavy workloads
– many data migrations (moving data from partition
to partition)
• Not suitable for databases with large size
– partitioning cost too high (distributed partitioning
algorithm may be used)
27