COSC6376 Cloud Computing
Lecture 8: BigTable and Dynamo
Instructor: Weidong Shi (Larry), PhD
Computer Science Department
University of Houston
Outline
• Plan
Project
Next Class
• BigTable and Hbase
• Dynamo
Projects
Sample Projects
• Support video processing using HDFS and
Mapreduce
• Image processing using cloud
• Security services using cloud
• Web analytics using cloud
• Cloud based MPI
• Novel applications of cloud based storage
• New pricing model
• Cyber physical system with cloud as the backend
• Bioinformatics using Mapreduce
Next week
In-Class Presentation
In-Class Presentation
•
•
•
•
Oct 3, Next Thursday
In-Class
Each team, 10 minutes
What should be included in the presentation
Team
Objectives
Plan of work
Project Proposal
• Due: Oct 8
• Formal project description (at most 4 pages)
Team members
Objective
Tools
Plan of work (tasks and assignments)
• Division of labor
Roadmap
Risk and mitigation strategy
Plan
• Today
Bigtable and Hbase
Dynamo
• Thursday
Dynamo
Paxos
Reading Assignment
Jenkins, if I want another yes-man, I’ll build one!
• Due: Thursday
9
Reading Assignment
• Due: Thursday
Bigtable
Fay Chang, et al @google.com
Global Picture
BigTable
• Distributed multi-level map
• Fault-tolerant, persistent
• Scalable
Thousands of servers
Terabytes of in-memory data
Petabyte of disk-based data
Millions of reads/writes per second, efficient scans
• Self-managing
Servers can be added/removed dynamically
Servers adjust to load imbalance
• Often want to examine data changes over time
E.g. Contents of a web page over multiple crawls
Basic Data Model
• A BigTable is a sparse, distributed persistent
multi-dimensional sorted map
(row, column, timestamp) -> cell contents
• Good match for most Google applications
Tablet
• Contains some range of rows of the table
• Built out of multiple SSTables
Tablet
64K
block
Start:aardvark
64K
block
64K
block
End:apple
SSTable
Index
64K
block
64K
block
64K
block
SSTable
Index
Chubby
• A persistent and distributed lock service.
• Consists of 5 active replicas, one replica is the
master and serves requests.
• Service is functional when majority of the replicas
are running and in communication with one
another – when there is a quorum.
• Implements a nameservice that consists of
directories and files.
Bigtable and Chubby
• Bigtable uses Chubby to:
Ensure there is at most one active master at a time,
Store the bootstrap location of Bigtable data (Root
tablet),
Discover tablet servers and finalize tablet server
deaths,
Store Bigtable schema information (column family
information),
Store access control list.
• If Chubby becomes unavailable for an extended
period of time, Bigtable becomes unavailable.
Tablet Serving
“Log Structured Merge Trees”
Image Source: Chang et al., OSDI 2006
Tablet Representation
read
write buffer in memory
(random-access)
append-only log on GFS
write
SSTable
on GFS
SSTable
on GFS
SSTable
on GFS
Tablet
SSTable: Immutable on-disk ordered map from stringstring
String keys: <row, column, timestamp> triples
19
Compactions
• Minor compaction
Converts the memtable into an SSTable
Reduces memory usage and log traffic on restart
• Merging compaction
Reads the contents of a few SSTables and the
memtable, and writes out a new SSTable
Reduces number of SSTables
• Major compaction
Merging compaction that results in only one SSTable
No deletion records, only live data
Refinements: Locality Groups
• Can group multiple column families into a
locality group
Separate SSTable is created for each locality group in
each tablet.
• Segregating columns families that are not
typically accessed together enables more
efficient reads.
In WebTable, page metadata can be in one group
and contents of the page in another group.
Refinements: Compression
• Many opportunities for compression
Similar values in the same row/column at different
timestamps
Similar values in different columns
Similar values across adjacent rows
• Two-pass custom compressions scheme
First pass: compress long common strings across a
large window
Second pass: look for repetitions in small window
• Speed emphasized, but good space reduction
(10-to-1)
Refinements: Bloom Filters
• Read operation has to read from disk when
desired SSTable isn’t in memory
• Reduce number of accesses by specifying a
Bloom filter.
Allows us ask if an SSTable might contain data for a
specified row/column pair.
Small amount of memory for Bloom filters drastically
reduces the number of disk seeks for read operations
Use implies that most lookups for non-existent rows
or columns do not need to touch disk
Bloom Filters
Approximate set membership problem
• Suppose we have a set
S = {s1,s2,...,sm} universe U
• Represent S in such a way we can quickly
answer “Is x an element of S ?”
• To take as little space as possible ,we allow false
positive (i.e. xS , but we answer yes )
• If xS , we must answer yes .
Bloom filters
Consist of an arrays A[n] of n bits (space) , and k independent
random hash functions
h1,…,hk : U --> {0,1,..,n-1}
• 1. Initially set the array to 0
• 2. sS, A[hi(s)] = 1 for 1 i k
(an entry can be set to 1 multiple times, only the first
times has an effect )
• 3. To check if xS , we check whether all location
A[hi(x)] for 1 i k are set to 1
If not, clearly xS.
If all A[hi(x)] are set to 1 ,we assume xS
x1
0
0
0
1
0
x2
0
1
0
1
0
0
0
1
0
0
1
0
Each element
of S isall
hashed
k times
Initial with
0
Each hash location set to 1
x1
0
0
0
1
0
y
0
1
x2
0
1
0
0
0
1
0
0
1
0
If only 1s appear, conclude that y is in S
This may yield false positive
Bigtable Applications
Application 1: Google Analytics
• Enables webmasters to analyze traffic pattern at
their web sites. Statistics such as:
Number of unique visitors per day and the page views
per URL per day,
Percentage of users that made a purchase given that
they earlier viewed a specific page.
• How?
A small JavaScript program that the webmaster
embeds in their web pages.
Every time the page is visited, the program is
executed.
Program records the following information about each
request:
• User identifier
• The page being fetched
Application 1: Google Analytics
• Two of the Bigtables
Raw click table (~ 200 TB)
• A row for each end-user session.
• Row name include website’s name and the time at
which the session was created.
• Clustering of sessions that visit the same web site. And a
sorted chronological order.
• Compression factor of 6-7.
Summary table (~ 20 TB)
• Stores predefined summaries for each web site.
• Generated from the raw click table by periodically
scheduled MapReduce jobs.
• Each MapReduce job extracts recent session data from the
raw click table.
• Row name includes website’s name and the column
family is the aggregate summaries.
• Compression factor is 2-3.
Application 2: Google Earth & Maps
• Functionality: Pan, view, and
annotate satellite imagery at
different resolution levels.
• One Bigtable stores raw
imagery (~ 70 TB):
Row name is a geographic
segments. Names are chosen
to ensure adjacent geographic
segments are clustered
together.
Column family maintains
sources of data for each
segment.
Application 3: Personalized Search
• Records user queries and clicks across Google
properties.
• Users browse their search histories and request
for personalized search results based on
their historical usage patterns.
• One Bigtable:
Row name is userid
A column family is reserved for each action type,
e.g., web queries, clicks.
User profiles are generated using MapReduce.
• These profiles personalize live search results.
Replicated geographically to reduce latency and
increase availability.
HBase is an open-source,
distributed, column-oriented
database built on top of HDFS based on
BigTable!
HBase is ..
• A distributed data store that can scale
horizontally to 1,000s of commodity servers and
petabytes of indexed storage.
• Designed to operate on top of the Hadoop
distributed file system (HDFS) or Kosmos File
System (KFS, aka Cloudstore) for scalability,
fault tolerance, and high availability.
Backdrop
• Started toward by Chad Walters and Jim
• 2006.11
Google releases paper on BigTable
• 2007.2
Initial HBase prototype created as Hadoop contrib.
• 2007.10
First useable HBase
• 2008.1
Hadoop become Apache top-level project and HBase becomes
subproject
• 2008.10~
HBase 0.18, 0.19 released
Why HBase ?
• HBase is a Bigtable clone.
• It is open source
• It has a good community and promise for the
future
• It is developed on top of and has good
integration for the Hadoop platform, if you are
using Hadoop already.
HBase Is Not …
• No join operators.
• Limited atomicity and transaction support.
HBase supports multiple batched mutations of single
rows only.
Data is unstructured and untyped.
• No accessed or manipulated via SQL.
Programmatic access via Java, REST, or Thrift APIs.
Scripting via JRuby.
HBase benefits than RDBMS
•
•
•
•
•
•
No real indexes
Automatic partitioning
Scale linearly and automatically with new nodes
Commodity hardware
Fault tolerance
Batch processing
Testing
$ hbase shell
> create 'test', 'data'
0 row(s) in 4.3066 seconds
> list
test
1 row(s) in 0.1485 seconds
> put 'test', 'row1', 'data:1',
'value1'
0 row(s) in 0.0454 seconds
> put 'test', 'row2', 'data:2',
'value2'
0 row(s) in 0.0035 seconds
> put 'test', 'row3', 'data:3',
'value3'
0 row(s) in 0.0090 seconds
> scan 'test'
ROW COLUMN+CELL
row1 column=data:1, timestamp=1240148026198,
value=value1
row2 column=data:2, timestamp=1240148040035,
value=value2
row3 column=data:3, timestamp=1240148047497,
value=value3
3 row(s) in 0.0825 seconds
> disable 'test'
09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled
test
0 row(s) in 6.0426 seconds
> drop 'test'
09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted
test
0 row(s) in 0.0210 seconds
> list
0 row(s) in 2.0645 seconds
Connecting to HBase
• Java client
get(byte [] row, byte [] column, long timestamp, int
versions);
• Non-Java clients
Thrift server hosting HBase client instance
• Sample ruby, c++, & java (via thrift) clients
REST server hosts HBase client
• TableInput/OutputFormat for MapReduce
HBase as MR source or sink
• HBase Shell
./bin/hbase shell YOUR_SCRIPT
Dynamo
Motivation
• Build a distributed storage system:
Scale
Simple: key-value
Highly available
Guarantee Service Level Agreements (SLA)
System Assumptions and Requirements
• Query Model:
simple read and write operations to a data
item that is uniquely identified by a key.
• Other Assumptions:
operation environment is assumed
to be non-hostile and there are no security related
requirements such as authentication and authorization.
Service Level Agreements (SLA)
• Application can deliver its
functionality in abounded
time: Every dependency in the
platform needs to deliver its
functionality with even tighter
bounds.
• Example:
service guaranteeing
that it will provide a response within
300ms for 99.9% of its requests for
a peak client load of 500 requests
per second.
Service-oriented architecture of
Amazon’s platform
Design Consideration
• Sacrifice strong consistency for availability
• Conflict resolution is executed during read
instead of write, i.e. “always writeable”.
• Other principles:
Incremental scalability.
Symmetry.
Decentralization.
Heterogeneity.
Summary of techniques used in
Dynamo and their advantages
Problem
Technique
Advantage
Partitioning
Consistent Hashing
Incremental Scalability
High Availability for writes
Vector clocks with reconciliation during
reads
Version size is decoupled from update
rates.
Handling temporary failures
Sloppy Quorum and hinted handoff
Provides high availability and durability
guarantee when some of the replicas are
not available.
Recovering from permanent failures
Anti-entropy using Merkle trees
Membership and failure detection
Gossip-based membership protocol and
failure detection.
Synchronizes divergent replicas in the
background.
Preserves symmetry and avoids having
a centralized registry for storing
membership and node liveness
information.
Partitioning and Consistent
Hashing
Caches can Load Balance
Server
Items distributed
among caches
Users get items
from caches
• Numerous items in central
server.
• Requests can swamp
server.
• Distribute items among
cache nodes.
• Clients get items from
cache nodes.
• Server gets only 1 request
per item.
Who Caches What?
• Each cache node should hold few items
else cache gets swamped by clients
• Each item should be in few cache nodes
else server gets swamped by caches
and cache invalidations/updates expensive
A Solution: Hashing
Server
items assigned to
caches
by hash function.
Users use hash to
compute cache for
item.
• Example:
y = ax+b (mod n)
• Intuition: Assigns items
to “random” cache nodes
few items per cache
• Easy to compute which
cache holds an item
Problem: Adding Cache Nodes
• Suppose a new cache node arrives.
• How does it affect the hash function?
• Natural change:
y=ax+b (mod n+1)
• Problem: changes bucket for every item
every cache node will be flushed
servers get swamped with new requests
Goal: when add bucket, few items move
Solution: Consistent Hashing
• Use standard hash function to map cache
nodes and items to points in unit interval.
“random” points spread uniformly
• Item assigned to nearest cache node
item
Cache (Bucket)
Computation easy as standard hash function
Properties
• All buckets get roughly same number of items
(like standard hashing).
• When kth bucket is added only a 1/k fraction
of items move.
and only from a few caches
When a cache node is added, minimal reshuffling
of cached items is required.
Consistent Hashing
Partition using consistent hashing
– Keys hash to a point on a fixed
circular space
– Ring is partitioned into a set of
ordered slots and servers and keys
hashed over these slots
Nodes take positions on the circle.
A, B, and D exists.
– B responsible for AB range.
– D responsible for BD range.
– A responsible for DA range.
C joins.
– B, D split ranges.
– C gets BC from D.
A
V
C
B
S
D
R
H
M
Virtual Nodes
“Virtual Nodes”: Each node can be
responsible for more than one virtual
node.
• If a node becomes unavailable the
load handled by this node is evenly
dispersed across the remaining
available nodes.
• When a node becomes available
again, the newly available node
accepts a roughly equivalent
amount of load from each of the
other available nodes.
A
V
C
B
S
D
R
H
M
Replication
• Each data item is replicated at N hosts.
• “preference list”: The list of nodes that is
responsible for storing a particular key.
© Copyright 2026 Paperzz