L4-1 NoSQLAndMongo

Cloud Computing
and Architecuture
NoSQL and MongoDB
BigData
• New services need to store and query very large data
sets
– Google, Amazon, Twitter, NetFlix, …
• Another issue is scalability
– Hyper-growth of # of users and traffic
Do not want
to go there!
CS@AU
Henrik Bærbak Christensen
2
Can RDBM do it?
• Performance:
– Time
• Queries do not scale linearly in the
size of the data set 
– 10.000 rows
– 1 billion rows
~
~
1 second
24 hours
• Answer: No
[Jacobs (2009)]
3
Why do RDBMs become slow?
• Jacobs’ paper explains the details.
• It boils down to:
– The memory hierarchy
• Cache, RAM, Disk, (Tape)
• Each step has a big performance penalty
– The reading pattern
• Sequential read faster than Random read
• Tape: obvious
but even for RAM this is true!
4
Normalization
• It becomes even worse due to the classic RDBM virtue of
normalization:
• A user and a transaction table
5
Jacobs’ statement
6
Denormalization technique
• Denormalization:
– Make aggregate objects
– Allows much faster
access
– Liability: Much more
space intensive
• And!
– Store data in the
sequences they are to be
queried…
7
NoSQL Databases
A new take on storing and quering big
data
8
Key features
• NoSQL: ”Not Only SQL” / Not Relational
–
–
–
–
–
–
Horizontal scaling of simple operations over many servers
Repliation and partitioning data over many servers
Simple call level interface
Weaker concurrency model than ACID
Efficient use of RAM and dist. Indexes for storage
Ability to dynamically add new attributes to records
• Architectural Drivers:
– Performance
– Scalabilty
[Cattel, 2010]
9
Clarifications
• NoSQL focus on
– Simple operations
• Key lookup, read/write of one or a few records
• Opposite:
Complex joins over many tables (SQL)
• (NoSQL generally denormalize and thus do not support joins)
– Horizontal scaling
• Many servers with no RAM nor disk sharing
• Commodity hardware
– Cheap but more prone to failures
10
NoSQL DB Types
11
Basic types
• Four types
12
Key-Value stores
• Basically a Java HashMap<K,V>()
• Typically RAM based, with periodic flushing to disk
• And
– If you loose the key, you are lost 
CS@AU
Henrik Bærbak Christensen
13
Document
• Stores ”documents”
– MongoDB:
JSON objects.
– Stronger queries, also in
document contents
– Schema: Any JSON object may be stored!
– Atomic updates, otherwise no concurrency control
• Supports
– Master-slave replication, automatic failover and recovery
– Automatic sharding
• Range-based, on shard key (like zip-code, CPR, etc.)
14
CAP Theorem
15
Reviewing ACID
• Basic RDBM teaching talks on ACID
• Atomicity
– Transaction: All or none succeed
• Concistency
– DB is in valid state before and after transaction
• Isolation
– N transactions executed concurrently = N executed in sequence
• Durability
– Once a transaction has been committed, it will remain so (even in
the event of power loss, crash)
16
CAP
• Eric Brewer: only get two of the three:
• Consistency
– Set of operations has occurred at once (Client view)
• Availability
– Operations will terminate in intended reponse
• Partition tolerence
– Operation will complete, even if components are unavailable
17
Horizontal Scaling: P taken
• We have already taken P, so we have to relax either
– Consistency
– Availability
• RDBM prefer consistency over availability
• NoSQL prefer availability over consistency
– replacing it with eventual consistency
18
Achieving ACID
• Two-phase Commit
1. All partitions pre-commit, report result to master
2. If success, master tells each to commit; else roll-back
• Guaranty consistency, but availability suffer
• Example
–
Two partitions, 99.9% availability
•
–
=> 99.92
= 99.8% (+43 min down every month)
Five partitions:
•
99,5% (36 hours down time in all)
19
BASE
• Replace ACID with BASE
• BA: Basically Available
• S:
Soft state
• E:
Eventual consistent
• Availability achieved by partial failures not leading to
system failures
– In two-phase commit, what would master do if one partition does
not repond?
20
Eventual Consistency
• So what does this mean?
– Upon write, an immediate read may retrieve the old value – or not
see the newest added item!
• Why? Gets data from a replica that is not yet updated…
– Eventual consistency:
• Given sufficiently long period of time which no updates, all replicas
are consistent, and thus all reads consistently return the same
data…
• System always available, but state is ‘soft’ / cached
21
Discussion
• Web applications
– Is it a problem that a facebook update takes some minutes to
appear at my friends?
22
Summary
• RDBMs have issues with ‘big data’
– Ignores the physical laws of hardware (random read)
– RDB model requires joins (random read)
• NoSQL
– Class of alternative DB models and impl.
– Two main classes
• Key-value stores
• Document stores
• CAP
– You can only get two of three
– NoSQL choose A, and sacrifice C for Eventual Consistency
CS@AU
Henrik Bærbak Christensen
23