Chubby lock service

The Chubby Lock Service for
Loosely-coupled Distributed
Systems
Mike Burrow, Google Inc
Presented by Xin (Joyce) Zhan
Outline
• Design
– System structure
– Locks, caching, failovers
– Scaling mechanism
• Use and observations
– As name service
– Failover problems
Lock service for distributed system
• Synchronize access to shared resources
• Other usage
– Primary election, meta-data storage, name
service
• Reliability, availability
System Strucure
System Structure
• Set of replicas
• Periodically elected master
– Master lease
– Paxos protocol
• All client requests are directed to master
– updates propagated to replicas
• Replace failed replicas
– master periodically polls DNS
Design
•
•
•
•
Store small files
Event notification mechanism
Consistent caching
Advisory lock (vs.mandatory)
– confilct only when others attempt to acquire
the same lock
• Coarse grained locks
– survive lock server failures
Design - File Interface
• Ease distribution
– /ls/fool/wombat/pouch
• Node meta-data include Access Control
Lists
• Handle
– analogous to UNIX file descriptors
– support for use across master changes
Design - Sequencer for lock
• Delayed / Out-of-order messages
– introduce sequence numbers into interactions
that use locks
– lock holder requests a sequencer, pass it to
file server to validate
• Alternative
– lock-delay
Design - Events
• Client subscribes when creating handle
• Delivered async via up-call from client library
• Event types
–
–
–
–
–
file contents modified
child node added / removed / modified
Chubby master failed over
handle / lock have become invalid
lock acquired / conflicting lock request (rarely used)
Design - Caching
• Clients cache file data and meta data
– Consistent, write-through
• Invalidation
–
–
–
–
master keeps list of what clients may have cached
master sends invalidations on top of KeepAlive
clients flush changed data, ack. with KeepAlive
server proceeds the modification only after
invalidation
• Clients cache open handle and locks
Design - Sessions
• Session maintained through KeepAlives
– handles, locks, cached data remain valid
– lease
• Lease timeout advanced when
– creation of a session
– master fail-over occurs
– master responds to KeepAlive RPC
Design - KeepAlive
• Master responds close to lease timeout
• Client sends another KeepAlive immediately
• Client maintains local lease timeout
– conservative approximation
• When local lease expires
– disable cache
– session in jeopardy, client waits in grace period
– cache enabled on reconnect
• Application informed about session changes
– Jeopardy/safe/expired event
Design – Failovers
Design - Failovers
• In-memory state discarded
– sessions, handles, locks, etc.
• Lease timer “stops”
• Fast master election
– client reconnect before lease expires
• Slow master election
– clients flush cache, enter grace period
• New master reconstruct the assumption of
in-memory state of previous master
Design - Failovers
Steps of newly-elected master:
•
Pick new epoch number
•
Respond only to master location requests
•
Build in-memory state for sessions / locks from
database
•
Respond to KeepAlives
•
Emit fail-over events to sessions, flush caches
•
Wait for acknowledgements / session expire
•
Allow all operations to proceed
•
Allow clients to use handles created before fail-over
•
Delete ephemeral files w/o open handles after an
interval
Design - Backup and Mirroring
• Master writes snapshots every few hours
– GFS server in different building
• Collection of files mirrored across cells
– /ls/global/master mirrored to /ls/cell/slave
• Mostly for configuration files
– Chubby’s own ACLs
– Files advertising presence / location
– pointers to Bigtable cells
Design - Scaling Mechanisms
• 90,000 clients communicate with one cell
• Regulate the number of Chubby cells
– client use the nearby cell
• Increase lease time
• Client caching
• Protocol-conversion servers
Scaling - Proxies
• Proxies pass requests from clients to cell
• Reduce traffic of KeepAlive and read
requests
– Not writes, but writes << 1% of workload
– KeepAlive traffic by far most dominant
• Overheads:
– additional RPC for writes / first time reads
– increased probability of unavailability
Scaling - Partitioning
• Namespace of a cell partitioned between
servers
• N partitions, each with master and replicas
– Node D/C stored on P(D/C) = hash(D) mod N
– meta-data for D may be on different partition
• Little cross-partition communication
• Reduce R/W traffic, no necessarily
KeepAlive
Use and Observations
• Many files for naming
• Config, ACL, metadata common
• 10 clients use each
cached file, on avg.
• Few locks held, no
shared locks
• KeepAlives dominate
RPC traffic
Use as Name Service
• DNS uses TTL values
– entries must be refreshed within that time
– huge (and variable) load on DNS server
• Chubby’s caching uses invalidations, no
polling
– client builds up needed entries in cache
– name entries further grouped in batches
Failover problems
• Master writes sessions to DB when created
– Overload when start of many processes at once
• Instead, store session at first modification / lock
acquisition etc.
• Active sessions recorded with probability on
KeepAlive
– spread out writes in time
– young read-only session may be discarded in a failover
Failover problems
• New design – do not record sessions in
database
– recreate them like handles after fail-over
– new master waits full lease time before
operations proceed
Lesson learnt
• Developers rarely consider availability
– should plan for short Chubby outages
• Fine-grained locking not essential
• Poor API choices
– handles acquiring locks cannot be shared
• RPC use affects transport protocols
– forced to send KeepAlives by UDP for
timeliness
Q&A