Cloud Computing and Architecuture NoSQL and MongoDB BigData • New services need to store and query very large data sets – Google, Amazon, Twitter, NetFlix, … • Another issue is scalability – Hyper-growth of # of users and traffic Do not want to go there! CS@AU Henrik Bærbak Christensen 2 Can RDBM do it? • Performance: – Time • Queries do not scale linearly in the size of the data set – 10.000 rows – 1 billion rows ~ ~ 1 second 24 hours • Answer: No [Jacobs (2009)] 3 Why do RDBMs become slow? • Jacobs’ paper explains the details. • It boils down to: – The memory hierarchy • Cache, RAM, Disk, (Tape) • Each step has a big performance penalty – The reading pattern • Sequential read faster than Random read • Tape: obvious but even for RAM this is true! 4 Normalization • It becomes even worse due to the classic RDBM virtue of normalization: • A user and a transaction table 5 Jacobs’ statement 6 Denormalization technique • Denormalization: – Make aggregate objects – Allows much faster access – Liability: Much more space intensive • And! – Store data in the sequences they are to be queried… 7 NoSQL Databases A new take on storing and quering big data 8 Key features • NoSQL: ”Not Only SQL” / Not Relational – – – – – – Horizontal scaling of simple operations over many servers Repliation and partitioning data over many servers Simple call level interface Weaker concurrency model than ACID Efficient use of RAM and dist. Indexes for storage Ability to dynamically add new attributes to records • Architectural Drivers: – Performance – Scalabilty [Cattel, 2010] 9 Clarifications • NoSQL focus on – Simple operations • Key lookup, read/write of one or a few records • Opposite: Complex joins over many tables (SQL) • (NoSQL generally denormalize and thus do not support joins) – Horizontal scaling • Many servers with no RAM nor disk sharing • Commodity hardware – Cheap but more prone to failures 10 NoSQL DB Types 11 Basic types • Four types 12 Key-Value stores • Basically a Java HashMap<K,V>() • Typically RAM based, with periodic flushing to disk • And – If you loose the key, you are lost CS@AU Henrik Bærbak Christensen 13 Document • Stores ”documents” – MongoDB: JSON objects. – Stronger queries, also in document contents – Schema: Any JSON object may be stored! – Atomic updates, otherwise no concurrency control • Supports – Master-slave replication, automatic failover and recovery – Automatic sharding • Range-based, on shard key (like zip-code, CPR, etc.) 14 CAP Theorem 15 Reviewing ACID • Basic RDBM teaching talks on ACID • Atomicity – Transaction: All or none succeed • Concistency – DB is in valid state before and after transaction • Isolation – N transactions executed concurrently = N executed in sequence • Durability – Once a transaction has been committed, it will remain so (even in the event of power loss, crash) 16 CAP • Eric Brewer: only get two of the three: • Consistency – Set of operations has occurred at once (Client view) • Availability – Operations will terminate in intended reponse • Partition tolerence – Operation will complete, even if components are unavailable 17 Horizontal Scaling: P taken • We have already taken P, so we have to relax either – Consistency – Availability • RDBM prefer consistency over availability • NoSQL prefer availability over consistency – replacing it with eventual consistency 18 Achieving ACID • Two-phase Commit 1. All partitions pre-commit, report result to master 2. If success, master tells each to commit; else roll-back • Guaranty consistency, but availability suffer • Example – Two partitions, 99.9% availability • – => 99.92 = 99.8% (+43 min down every month) Five partitions: • 99,5% (36 hours down time in all) 19 BASE • Replace ACID with BASE • BA: Basically Available • S: Soft state • E: Eventual consistent • Availability achieved by partial failures not leading to system failures – In two-phase commit, what would master do if one partition does not repond? 20 Eventual Consistency • So what does this mean? – Upon write, an immediate read may retrieve the old value – or not see the newest added item! • Why? Gets data from a replica that is not yet updated… – Eventual consistency: • Given sufficiently long period of time which no updates, all replicas are consistent, and thus all reads consistently return the same data… • System always available, but state is ‘soft’ / cached 21 Discussion • Web applications – Is it a problem that a facebook update takes some minutes to appear at my friends? 22 Summary • RDBMs have issues with ‘big data’ – Ignores the physical laws of hardware (random read) – RDB model requires joins (random read) • NoSQL – Class of alternative DB models and impl. – Two main classes • Key-value stores • Document stores • CAP – You can only get two of three – NoSQL choose A, and sacrifice C for Eventual Consistency CS@AU Henrik Bærbak Christensen 23
© Copyright 2026 Paperzz