The Chubby Lock Service for Loosely-coupled Distributed Systems Mike Burrow, Google Inc Presented by Xin (Joyce) Zhan Outline • Design – System structure – Locks, caching, failovers – Scaling mechanism • Use and observations – As name service – Failover problems Lock service for distributed system • Synchronize access to shared resources • Other usage – Primary election, meta-data storage, name service • Reliability, availability System Strucure System Structure • Set of replicas • Periodically elected master – Master lease – Paxos protocol • All client requests are directed to master – updates propagated to replicas • Replace failed replicas – master periodically polls DNS Design • • • • Store small files Event notification mechanism Consistent caching Advisory lock (vs.mandatory) – confilct only when others attempt to acquire the same lock • Coarse grained locks – survive lock server failures Design - File Interface • Ease distribution – /ls/fool/wombat/pouch • Node meta-data include Access Control Lists • Handle – analogous to UNIX file descriptors – support for use across master changes Design - Sequencer for lock • Delayed / Out-of-order messages – introduce sequence numbers into interactions that use locks – lock holder requests a sequencer, pass it to file server to validate • Alternative – lock-delay Design - Events • Client subscribes when creating handle • Delivered async via up-call from client library • Event types – – – – – file contents modified child node added / removed / modified Chubby master failed over handle / lock have become invalid lock acquired / conflicting lock request (rarely used) Design - Caching • Clients cache file data and meta data – Consistent, write-through • Invalidation – – – – master keeps list of what clients may have cached master sends invalidations on top of KeepAlive clients flush changed data, ack. with KeepAlive server proceeds the modification only after invalidation • Clients cache open handle and locks Design - Sessions • Session maintained through KeepAlives – handles, locks, cached data remain valid – lease • Lease timeout advanced when – creation of a session – master fail-over occurs – master responds to KeepAlive RPC Design - KeepAlive • Master responds close to lease timeout • Client sends another KeepAlive immediately • Client maintains local lease timeout – conservative approximation • When local lease expires – disable cache – session in jeopardy, client waits in grace period – cache enabled on reconnect • Application informed about session changes – Jeopardy/safe/expired event Design – Failovers Design - Failovers • In-memory state discarded – sessions, handles, locks, etc. • Lease timer “stops” • Fast master election – client reconnect before lease expires • Slow master election – clients flush cache, enter grace period • New master reconstruct the assumption of in-memory state of previous master Design - Failovers Steps of newly-elected master: • Pick new epoch number • Respond only to master location requests • Build in-memory state for sessions / locks from database • Respond to KeepAlives • Emit fail-over events to sessions, flush caches • Wait for acknowledgements / session expire • Allow all operations to proceed • Allow clients to use handles created before fail-over • Delete ephemeral files w/o open handles after an interval Design - Backup and Mirroring • Master writes snapshots every few hours – GFS server in different building • Collection of files mirrored across cells – /ls/global/master mirrored to /ls/cell/slave • Mostly for configuration files – Chubby’s own ACLs – Files advertising presence / location – pointers to Bigtable cells Design - Scaling Mechanisms • 90,000 clients communicate with one cell • Regulate the number of Chubby cells – client use the nearby cell • Increase lease time • Client caching • Protocol-conversion servers Scaling - Proxies • Proxies pass requests from clients to cell • Reduce traffic of KeepAlive and read requests – Not writes, but writes << 1% of workload – KeepAlive traffic by far most dominant • Overheads: – additional RPC for writes / first time reads – increased probability of unavailability Scaling - Partitioning • Namespace of a cell partitioned between servers • N partitions, each with master and replicas – Node D/C stored on P(D/C) = hash(D) mod N – meta-data for D may be on different partition • Little cross-partition communication • Reduce R/W traffic, no necessarily KeepAlive Use and Observations • Many files for naming • Config, ACL, metadata common • 10 clients use each cached file, on avg. • Few locks held, no shared locks • KeepAlives dominate RPC traffic Use as Name Service • DNS uses TTL values – entries must be refreshed within that time – huge (and variable) load on DNS server • Chubby’s caching uses invalidations, no polling – client builds up needed entries in cache – name entries further grouped in batches Failover problems • Master writes sessions to DB when created – Overload when start of many processes at once • Instead, store session at first modification / lock acquisition etc. • Active sessions recorded with probability on KeepAlive – spread out writes in time – young read-only session may be discarded in a failover Failover problems • New design – do not record sessions in database – recreate them like handles after fail-over – new master waits full lease time before operations proceed Lesson learnt • Developers rarely consider availability – should plan for short Chubby outages • Fine-grained locking not essential • Poor API choices – handles acquiring locks cannot be shared • RPC use affects transport protocols – forced to send KeepAlives by UDP for timeliness Q&A
© Copyright 2025 Paperzz