Providing Big Data Applications with Fault

Providing Big Data Applications with
Fault-Tolerant Data Migration Across
Heterogeneous NoSQL databases
Marco Scavuzzo, Damian A. Tamburri, Elisabetta Di Nitto
Politecnico di Milano, Italy
BIGDSE ‘16 – May 16th 2016, Austin
NoSQLs and Big Data
applications
Highly-available, big data applications need specific storage
technologies:
Distributed File Systems – DFSs (e.g., HDFS, Ceph, etc.)
NoSQL databases (e.g., Riak, Cassandra, MongoDB, Neo4j, etc.)
NoSQLs preferred to DFSs for:
Efficient data access (for reads and/or writes)
Concurrent data access
Adjustable data consistency and integrity policies
Logics (filter, group, aggregate) in the data layer in place of the
application layer (Hive, Pig, etc.).
2
NoSQLs heterogeneity
Lack of standard data access interfaces and languages
Lack of common data models (e.g., data types, secondary indexes,
integrity constraints, etc.)
Different architectures leading to different ways
of approaching important problems (e.g., concurrency control,
replication, transactions, etc.)
3
Vendor
lock-in
“The lack of standards due to most
NoSQLs creating their own APIs [..]
is going to be a nightmare in due
course of time w.r.t. porting
applications from one NoSQL to
another. Whether it is an open
source system or a proprietary one,
users will feel locked in.”
C. Mohan
4
Research objective
Provide a method and supporting architecture to aid
fault-tolerant data migration across heterogeneous
NoSQL databases for Big Data applications
Hegira4Cloud
5
Hegira4Cloud
requirements
1. Big Data migration across any NoSQL database and Database
as a Service (DaaS)
2. High performant data migration
3. Fault-tolerant data migration
6
Hegira4Cloud approach
Conversion to the
Metamodel Format
Migration System Core
Source
DB
SRC
MIGRATION QUEUE
TWC
Target
DB
Conversion from the
Metamodel Format
7
Hegira4Cloud V1
Migration System Core
Source
DB
SRC
MIGRATION QUEUE
TWC
Target
DB
Monolithic architecture data migration GAE Datastore -> Azure Tables
dataset #1
dataset #2
dataset #3
Source size (MB)
16
64
512
# of Entities
36940 ~18m 147758 ~71m 1182062 ~568m
Migration time (sec)
1098
4270
34111
Entities throughput (ent/s)
33.643
34.604
34.653
Avg. %CPU usage
4.749
3.947
4.111
8
Improving performance:
components decoupling
Source
DB
SRC
MIGRATION QUEUE
TWC
Target
DB
Components decoupling helps in:
distributing the computation (conversion to/from the intermediate meta
model);
isolating possible bottlenecks;
finding (and solving) errors.
9
Improving performance:
parallelization
Source
DB
SRC
SRC
SRC
MIGRATION QUEUE
TWC
TWC
TWC
Target
DB
Operations to be executed can be parallelized:
data extraction (from the source database)
data should be partitionable
data load (to the target database)
10
Improving performance:
TWC parallelization
Source
DB
SRC
SRC
SRC
MIGRATION QUEUE
TWC
TWC
TWC
Target
DB
Challenges:
avoid to duplicate data (i.e., process disjunct data only once)
avoid threads starvation
in case of fault, already extracted data shouldn’t be lost
Solution: RabbitMQ
messages distributed (disjointly) in round-robin fashion
messages correctly processed are acknowledged and removed
messages are persisted on disk
11
Improving performance:
SRC parallelization
Source
DB
SRC
SRC
SRC
METAMODEL QUEUE
TWC
TWC
TWC
Target
DB
Challenges:
complete knowledge of stored data is needed to partition data
partitions should be processed at most once (to avoid duplications)
12
Improving performance:
SRC parallelization
Source
DB
Lets assume
that data are
associated
VDP1
with an
unique,
incremental
primary key (orVDP2
an indexed
property)
VDP3
SRC
SRC
SRC
MIGRATION QUEUE
TWC
TWC
TWC
Target
DB
Source DB
1
…
10
11
…
References to
the VDPs are
stored inside a
persistent
storage
20
21
…
30
13
Addressing faults
Source
DB
SRC
MIGRATION QUEUE
TWC
Target
DB
Types of (non-trivial) faults:
Database faults
Components faults
Network faults
Connection loss
On connection loss, not all databases guarantee a unique pointer to the
data (e.g., Google Datastore)
14
Virtual data partitioning
Source DB
Key
Values
PARTITION STATUS
1
VDP1
2
…
not mig.
migrate
under
mig.
finish_mig
migrated
10
11
VDP2
VDP3
Status Log
…
20
VDPid
Status
21
1
migrated
…
2
under_mig
30
3
not_mig
ZooKeeper
15
Hegira4Cloud V2
STATUS
LOG
Source
DB
SRC
MIGRATION QUEUE
TWC
Target
DB
16
Hegira4Cloud V2:
Evaluation
Parallel distributed
architecture
Monolithic arcitecture
dataset #1 dataset #2 dataset #3 dataset #1
Source size (MB)
16
64
512
318464 (311GB)
# of Entities
36940
147758
1182062
~107M
Migration time (sec)
Entities throughput
(ent/s)
1098
4270
34111
124867 (34½h)
33.643
34.604
34.653
856.41
Avg. %CPU usage
4.749
3.947
4.111
49.87
1 Source Reading Thread
40 Target Writing Threads
17
Conclusions
Efficient, fault-tolerant method for data migration
Architecture supporting data migration across NoSQL
databases
Supporting several databases (Azure Tables, Cassandra,
Google Datastore, HBase)
Evaluated on industrial case study
Future work
Support online data migrations
Rigorous tests for assessing data completeness and
correctness
19
Marco
Scavuzzo
PhD student @
Politecnico di Milano
You can find me at: [email protected]
Credits
Presentation template by SlidesCarnival