Large-scale cluster management at Google with Borg By: Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, John Wilkes Presented by: Xianghan Pei EECS 582 – W16 1 Outline • Introduction & Motivation • The User Perspective • Borg Architecture • Scalability & Availability • Utilization • Isolation • Lessons • Future and Related Work EECS 582 – W16 2 Introduction & Motivation • Introduction • Borg: An OS of Cluster(Datacenter) • Motivation • Hide the details(programmer focus on App) • Provide resource sharing The high-level architecture of Borg • Provide high reliability and availability for Cluster EECS 582 – W16 3 The User Perspective • Workload • production(prod) VS non-production(non-prod) • Cells • Heterogeneous, around 10 k machines in median • Jobs and Tasks > borgcfg .../hello_world_webserver.borg up EECS 582 – W16 4 The User Perspective • Allocs • Reserved set of resources • Priority, Quota, and Admission Control • Job has a priority (preempting) • Quota is used to decide which jobs to admit for scheduling • Naming and Monitoring • 50.jfoo.ubar.cc.borg.google.com • Monitoring health of the task and thousands of performance metrics EECS 582 – W16 5 Borg Architecture • Borgmaster • Main Borgmaster process & Scheduler • Five replicas • Borglet • Manage and monitor tasks and resource • Borgmaster polls Borglet every few seconds The high-level architecture of Borg(A Cell) EECS 582 – W16 6 Borg Architecture • Scheduling • feasibility checking: find machines • Scoring: pick one machines • User preferences & build-in criteria • E-PVM VS best-fit • Tradeoff The high-level architecture of Borg(A Cell) EECS 582 – W16 7 Scalability • Separate scheduler • Separate threads to poll the Borglets • Partition functions across the five replicas • Score caching • Equivalence classes • Relaxed randomization The high-level architecture of Borg(A Cell) EECS 582 – W16 8 Availability • System • Borgmaster: Five replicas • Borglet: Borgmaster • Take checkpoint • Job • Reschedules evicted tasks • Spreading tasks across failure domian • … EECS 582 – W16 9 Utilization • Cell Sharing • Segregating prod and non-prod work into different cells would need more machines EECS 582 – W16 10 Utilization • Cell Size • Subdividing cells into smaller ones would require more machines EECS 582 – W16 11 Utilization • Fine-grained resource requests No bucket sizes fit most of the tasks well Bucketing resource would need more machines EECS 582 – W16 12 Utilization • Resource reclamation • estimate how many resources a task will use and reclaim the rest for work • Kill non-prod if not available EECS 582 – W16 13 Utilization • Resource reclamation • Choose medium EECS 582 – W16 14 Isolation • Security isolation • chroot jail as the primary security isolation mechanism • Performance isolation • High-priority LV task(prod) get best treatment • compressible resources: reclaimed • non-compressible resources: kill or remove task EECS 582 – W16 15 Lessons • Bad • Jobs are restrictive as the only grouping mechanism for tasks • One IP address per machine complicates things • Optimizing for power users at the expense of casual ones • Good • • • • Allocs are useful Cluster management is more than task management Introspection is vital The master is the kernel of a distributed system EECS 582 – W16 16 Future Work • Kubernetes • Evolved from Borg • An open-source system for automating deployment, operations, and scaling of containerized applications • Pods: groups of containers • Labels • Replica controller • Services EECS 582 – W16 17 Relative Work • Apache Mesos • YARN • Tupperware EECS 582 – W16 18 Questions? EECS 582 – W16 19 References • Borg Paper • Borg Slides at InfoQ EECS 582 – W16 20
© Copyright 2026 Paperzz