Full size

Web Systems and Algorithms
Google
Chris Brooks
Department of Computer Science
University of San Francisco
Department of Computer Science — University of San Francisco – p.1/??
Cloud Computing at Google
Google has developed a layered system to handle
webscale applications.
Google File System
BigTable
MapReduce
Department of Computer Science — University of San Francisco – p.2/??
Google File System
What are the primary design issues surrounding GFS?
Department of Computer Science — University of San Francisco – p.3/??
Design Issues
Commodity hardware - failures are the rule
Huge files
Files tend to be written, then either appended or
streamed
Random writes are rare
What sorts of applications would have this behavior?
Multiple users may simultaneously write to a file
API and application design should happen in tandem
Sustained bandwidth more important than latency
Department of Computer Science — University of San Francisco – p.4/??
Architecture
Single master
Many chunkservers
Many clients
Files are divided into 64MB chunks.
A chunk is redundantly stored at many chunkservers.
Department of Computer Science — University of San Francisco – p.5/??
Master
What is the master’s role?
Department of Computer Science — University of San Francisco – p.6/??
Master
Maintain metadata
namespace, access control, mapping of files to
chunks, chunk locations.
Refer clients to chunkservers.
Control lease management, garbage collection,
migration
Department of Computer Science — University of San Francisco – p.7/??
Chunkserver
What is the chunkserver’s role?
Department of Computer Science — University of San Francisco – p.8/??
Chunkserver
What is the chunkserver’s role?
Serve up chunks to clients
Department of Computer Science — University of San Francisco – p.9/??
Control flow
What is the typical order of operations for a client that
wants to read a file?
Department of Computer Science — University of San Francisco – p.10/??
Control flow
Client sends filename and offset to master.
Master returns chunk handle and replica locations.
Client chooses a replica and requests chunk range.
Master not needed for further data exchange.
Department of Computer Science — University of San Francisco – p.11/??
Advantages
What are some advantages of this approach?
Department of Computer Science — University of San Francisco – p.12/??
Advantages
Simplicity of master.
Data held in memory. No need to handle chunk
access.
Easy to handle failure; client just requests a new chunk.
Department of Computer Science — University of San Francisco – p.13/??
Persistence
Is a single master a potential point of failure?
How can a master recover from a crash?
Department of Computer Science — University of San Francisco – p.14/??
Persistence
Master keeps all data structures in memory
Each file action is logged.
Master also periodically checkpoints. On failure, reload
from checkpoint and play back log.
Department of Computer Science — University of San Francisco – p.15/??
Chunk info
How does the master know what chunks are stored at
each chunkserver?
Department of Computer Science — University of San Francisco – p.16/??
Chunk info
Master periodically sends a heartbeat to each
chunkserver.
chunkserver responds with list of all stored chunks and
their status.
Occasionally, master may have stale information.
Simplifies master and reduces overhead.
Department of Computer Science — University of San Francisco – p.17/??
Consistency
What does consistency mean?
What does “defined” mean?
Department of Computer Science — University of San Francisco – p.18/??
Consistency
What does consistency mean?
All clients see the same data
What does “defined” mean?
All clients see the complete results of a mutation.
If a single mutation succeeds, it is consistent and
defined.
Concurrent writes may be consistent but not defined.
Appends are handled more efficiently than random
writes.
Department of Computer Science — University of San Francisco – p.19/??
Implications for applications
What implications does this model have for an
application?
Department of Computer Science — University of San Francisco – p.20/??
Implications for applications
What implications does this model have for an
application?
Applications should append when possible
Applications need to keep track of the defined region of
the file.
Applications will need to tolerate or filter occasional
duplicate records.
Department of Computer Science — University of San Francisco – p.21/??
Leases
What is a lease? How is it used?
Department of Computer Science — University of San Francisco – p.22/??
Leases
What is a lease? How is it used?
A lease is an object that is used to allow mutations to a
chunk.
The master grants this to one chunkserver (the primary)
which then coordinates writes with other replicas.
Department of Computer Science — University of San Francisco – p.23/??
Writing replicated data
What is the order of operations for writing replicated
data?
Department of Computer Science — University of San Francisco – p.24/??
Writing replicated data
Client obtains a lease
Sends write request to primary
Client sends data to all replicas; these are cached.
Primary sends write request to all replicas. All replicas
process writes to that chunk in the same order.
What if a replica fails during this operation?
Department of Computer Science — University of San Francisco – p.25/??
Data flow
Data is pushed between replicas in a linear fashion.
This is an interesting choice; they could have used
multicast, or a tree.
Why is this?
Department of Computer Science — University of San Francisco – p.26/??
Bigtable
Bigtable is implemented on top of GFS
What are the goals of bigtable?
What does it not provide?
Department of Computer Science — University of San Francisco – p.27/??
Bigtable
Bigtable is implemented on top of GFS
What are the goals of bigtable?
High availability, scalability, high performance
What does it not provide?
Complex relational queries, datatypes
Department of Computer Science — University of San Francisco – p.28/??
Data model
What is Bigtable’s data model?
Department of Computer Science — University of San Francisco – p.29/??
Data model
What is Bigtable’s data model?
Multidimensional map: row name, column name,
timestamp map to a data cell (string).
Department of Computer Science — University of San Francisco – p.30/??
Rows
Rows are broken into ranges called tablets, arranged
lexicographically.
What is the thinking behind this?
Department of Computer Science — University of San Francisco – p.31/??
Column families
Column keys are grouped into column families.
What is the thinking behind this?
Department of Computer Science — University of San Francisco – p.32/??
Data storage
GFS is used to store data.
Bigtable can coexist with other applications.
Data files are written out using the SSTable file format.
Chubby is used to provide locking and synchronization.
Department of Computer Science — University of San Francisco – p.33/??
Architecture
Master
Tablet servers
Clients
Chubby
Department of Computer Science — University of San Francisco – p.34/??
Tablet servers
What do tablet servers do?
Department of Computer Science — University of San Francisco – p.35/??
Tablet servers
What do tablet servers do?
Handle interactions with clients, read and write data
Tablets are not replicated.
Department of Computer Science — University of San Francisco – p.36/??
How does a client find a tablet?
Root tablet accessed via Chubby
This contains a map of tablets to tablet servers.
This info is then cached by the client.
Client communicates directly with the server.
Department of Computer Science — University of San Francisco – p.37/??
Master
What is the role of the master?
Department of Computer Science — University of San Francisco – p.38/??
Master
Keep track of tablet servers
Place unassigned tablets.
Department of Computer Science — University of San Francisco – p.39/??
Master
How can the master tell that a tablet server has died?
Department of Computer Science — University of San Francisco – p.40/??
Master
How can the master tell that a tablet server has died?
When a tablet server starts, it creates a lock in Chubby.
Master queries server for the status of the lock.
If server does not reply, master attempts to acquire lock.
If successful, it redistributes that server’s tablets.
Department of Computer Science — University of San Francisco – p.41/??
Discussion
How does BigTable’s architecture compare to GFS?
What advantages does this structure have?
How does this compare to architectures such as Can or
Chord that you might’ve learned about in 682?
Department of Computer Science — University of San Francisco – p.42/??
MapReduce
What is the basic paradigm of mapreduce?
Department of Computer Science — University of San Francisco – p.43/??
MapReduce
Define a map operation that is applied to each record in
an input to generate key/value pairs
Define a reduce operation applied to all elements with
the same key to aggregate results.
Department of Computer Science — University of San Francisco – p.44/??
Example
the classic example, counting words:
def map(document, words) :
for word in words.split() :
yield word, 1
def reduce(key, words) :
yield key, sum(words)
Department of Computer Science — University of San Francisco – p.45/??
Parallelizing
Structuring your problem in this way allows the map
function to run simultaneously on many different
machines on subsets of your data.
Reduce can then run in parallel for each key.
Department of Computer Science — University of San Francisco – p.46/??
Implementation
Input data is split into a number of sets.
Keyspace is subdivided.
A master is used to assign tasks to workers.
Each mapping task is performed independently.
results are eventually buffered, and the location returned
to the master.
The master then forwards mapped locations to reduce
workers.
Reduce workers collect all data associated with their
keys, perform reduce, and write data to file.
Department of Computer Science — University of San Francisco – p.47/??
Failure
How is worker failure handled?
Department of Computer Science — University of San Francisco – p.48/??
Failure
How is worker failure handled?
Workers are pinged.
Active tasks belonging to non-responsive workers are
reassigned.
Completed map tasks must be redone.
Department of Computer Science — University of San Francisco – p.49/??
Failure
How is master failure handled?
Department of Computer Science — University of San Francisco – p.50/??
Failure
How is master failure handled?
Checkpointing
Restarting
Department of Computer Science — University of San Francisco – p.51/??
Refinements
The authors describe a number of refinements to
MapReduce.
What are they and why are they useful?
Department of Computer Science — University of San Francisco – p.52/??
Refinements
User-defined partitioning
User-defined combining
Specialized readers
skipping bad records
Department of Computer Science — University of San Francisco – p.53/??
MR vs DBMS
Stonebraker, et al identify the sorts of tasks that
MapReduce (Hadoop) excels at, and that RDBMS excel
at.
MapReduce:
Extract-Transform-Load
Complex analytics that require multiple passes
Semi-structured data (key-value pairs)
Quick-and-dirty problems
Limited budget
Department of Computer Science — University of San Francisco – p.54/??
MR vs DBMS
Parallel DBMS:
Grep
log mining with group by
join (combine user visits to URLs with PageRank
table)
Department of Computer Science — University of San Francisco – p.55/??
MR vs DBMS
Stonebraker, et al suggest some reasons why DBMS
might do better even on tasks that seem to be in
MapReduce’s area of expertise:
Repeated parsing of records
Tuned compression in DBMS
Intermediate data streamed, rather than written to
disk
Scheduling - DBMSs construct a query plan
Department of Computer Science — University of San Francisco – p.56/??
Takeaway
Hadoop could incorporate streaming and more
job-aware scheduling
SQL is arguably easier to write than mapReduce code.
DBMSs need to be more plug-and-play
DBMSs should work with filesystem data.
Department of Computer Science — University of San Francisco – p.57/??