Providing Big Data Applications with Fault-Tolerant Data Migration Across Heterogeneous NoSQL databases Marco Scavuzzo, Damian A. Tamburri, Elisabetta Di Nitto Politecnico di Milano, Italy BIGDSE ‘16 – May 16th 2016, Austin NoSQLs and Big Data applications Highly-available, big data applications need specific storage technologies: Distributed File Systems – DFSs (e.g., HDFS, Ceph, etc.) NoSQL databases (e.g., Riak, Cassandra, MongoDB, Neo4j, etc.) NoSQLs preferred to DFSs for: Efficient data access (for reads and/or writes) Concurrent data access Adjustable data consistency and integrity policies Logics (filter, group, aggregate) in the data layer in place of the application layer (Hive, Pig, etc.). 2 NoSQLs heterogeneity Lack of standard data access interfaces and languages Lack of common data models (e.g., data types, secondary indexes, integrity constraints, etc.) Different architectures leading to different ways of approaching important problems (e.g., concurrency control, replication, transactions, etc.) 3 Vendor lock-in “The lack of standards due to most NoSQLs creating their own APIs [..] is going to be a nightmare in due course of time w.r.t. porting applications from one NoSQL to another. Whether it is an open source system or a proprietary one, users will feel locked in.” C. Mohan 4 Research objective Provide a method and supporting architecture to aid fault-tolerant data migration across heterogeneous NoSQL databases for Big Data applications Hegira4Cloud 5 Hegira4Cloud requirements 1. Big Data migration across any NoSQL database and Database as a Service (DaaS) 2. High performant data migration 3. Fault-tolerant data migration 6 Hegira4Cloud approach Conversion to the Metamodel Format Migration System Core Source DB SRC MIGRATION QUEUE TWC Target DB Conversion from the Metamodel Format 7 Hegira4Cloud V1 Migration System Core Source DB SRC MIGRATION QUEUE TWC Target DB Monolithic architecture data migration GAE Datastore -> Azure Tables dataset #1 dataset #2 dataset #3 Source size (MB) 16 64 512 # of Entities 36940 ~18m 147758 ~71m 1182062 ~568m Migration time (sec) 1098 4270 34111 Entities throughput (ent/s) 33.643 34.604 34.653 Avg. %CPU usage 4.749 3.947 4.111 8 Improving performance: components decoupling Source DB SRC MIGRATION QUEUE TWC Target DB Components decoupling helps in: distributing the computation (conversion to/from the intermediate meta model); isolating possible bottlenecks; finding (and solving) errors. 9 Improving performance: parallelization Source DB SRC SRC SRC MIGRATION QUEUE TWC TWC TWC Target DB Operations to be executed can be parallelized: data extraction (from the source database) data should be partitionable data load (to the target database) 10 Improving performance: TWC parallelization Source DB SRC SRC SRC MIGRATION QUEUE TWC TWC TWC Target DB Challenges: avoid to duplicate data (i.e., process disjunct data only once) avoid threads starvation in case of fault, already extracted data shouldn’t be lost Solution: RabbitMQ messages distributed (disjointly) in round-robin fashion messages correctly processed are acknowledged and removed messages are persisted on disk 11 Improving performance: SRC parallelization Source DB SRC SRC SRC METAMODEL QUEUE TWC TWC TWC Target DB Challenges: complete knowledge of stored data is needed to partition data partitions should be processed at most once (to avoid duplications) 12 Improving performance: SRC parallelization Source DB Lets assume that data are associated VDP1 with an unique, incremental primary key (orVDP2 an indexed property) VDP3 SRC SRC SRC MIGRATION QUEUE TWC TWC TWC Target DB Source DB 1 … 10 11 … References to the VDPs are stored inside a persistent storage 20 21 … 30 13 Addressing faults Source DB SRC MIGRATION QUEUE TWC Target DB Types of (non-trivial) faults: Database faults Components faults Network faults Connection loss On connection loss, not all databases guarantee a unique pointer to the data (e.g., Google Datastore) 14 Virtual data partitioning Source DB Key Values PARTITION STATUS 1 VDP1 2 … not mig. migrate under mig. finish_mig migrated 10 11 VDP2 VDP3 Status Log … 20 VDPid Status 21 1 migrated … 2 under_mig 30 3 not_mig ZooKeeper 15 Hegira4Cloud V2 STATUS LOG Source DB SRC MIGRATION QUEUE TWC Target DB 16 Hegira4Cloud V2: Evaluation Parallel distributed architecture Monolithic arcitecture dataset #1 dataset #2 dataset #3 dataset #1 Source size (MB) 16 64 512 318464 (311GB) # of Entities 36940 147758 1182062 ~107M Migration time (sec) Entities throughput (ent/s) 1098 4270 34111 124867 (34½h) 33.643 34.604 34.653 856.41 Avg. %CPU usage 4.749 3.947 4.111 49.87 1 Source Reading Thread 40 Target Writing Threads 17 Conclusions Efficient, fault-tolerant method for data migration Architecture supporting data migration across NoSQL databases Supporting several databases (Azure Tables, Cassandra, Google Datastore, HBase) Evaluated on industrial case study Future work Support online data migrations Rigorous tests for assessing data completeness and correctness 19 Marco Scavuzzo PhD student @ Politecnico di Milano You can find me at: [email protected] Credits Presentation template by SlidesCarnival
© Copyright 2026 Paperzz