Slides

Large-scale cluster
management at Google with
Borg
By: Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David
Oppenheimer, Eric Tune, John Wilkes
Presented by: Xianghan Pei
EECS 582 – W16
1
Outline
• Introduction & Motivation
• The User Perspective
• Borg Architecture
• Scalability & Availability
• Utilization
• Isolation
• Lessons
• Future and Related Work
EECS 582 – W16
2
Introduction & Motivation
• Introduction
• Borg: An OS of Cluster(Datacenter)
• Motivation
• Hide the details(programmer focus on App)
• Provide resource sharing
The high-level architecture of Borg
• Provide high reliability and availability for Cluster
EECS 582 – W16
3
The User Perspective
• Workload
• production(prod) VS non-production(non-prod)
• Cells
• Heterogeneous, around 10 k machines in median
• Jobs and Tasks
> borgcfg .../hello_world_webserver.borg up
EECS 582 – W16
4
The User Perspective
• Allocs
• Reserved set of resources
• Priority, Quota, and Admission Control
• Job has a priority (preempting)
• Quota is used to decide which jobs to admit for scheduling
• Naming and Monitoring
• 50.jfoo.ubar.cc.borg.google.com
• Monitoring health of the task and thousands of performance metrics
EECS 582 – W16
5
Borg Architecture
• Borgmaster
• Main Borgmaster process & Scheduler
• Five replicas
• Borglet
• Manage and monitor tasks and resource
• Borgmaster polls Borglet every few seconds
The high-level architecture of Borg(A Cell)
EECS 582 – W16
6
Borg Architecture
• Scheduling
• feasibility checking: find machines
• Scoring: pick one machines
• User preferences & build-in criteria
• E-PVM VS best-fit
• Tradeoff
The high-level architecture of Borg(A Cell)
EECS 582 – W16
7
Scalability
• Separate scheduler
• Separate threads to poll the Borglets
• Partition functions across the five replicas
• Score caching
• Equivalence classes
• Relaxed randomization
The high-level architecture of Borg(A Cell)
EECS 582 – W16
8
Availability
• System
• Borgmaster: Five replicas
• Borglet: Borgmaster
• Take checkpoint
• Job
• Reschedules evicted tasks
• Spreading tasks across failure domian
• …
EECS 582 – W16
9
Utilization
• Cell Sharing
• Segregating prod and non-prod work into different cells would need
more machines
EECS 582 – W16
10
Utilization
• Cell Size
• Subdividing cells into smaller ones would require more machines
EECS 582 – W16
11
Utilization
• Fine-grained resource requests
No bucket sizes fit most of the tasks well Bucketing resource would need more machines
EECS 582 – W16
12
Utilization
• Resource reclamation
• estimate how many
resources a task will use
and reclaim the rest for
work
• Kill non-prod if not
available
EECS 582 – W16
13
Utilization
• Resource reclamation
• Choose medium
EECS 582 – W16
14
Isolation
• Security isolation
• chroot jail as the primary security isolation mechanism
• Performance isolation
• High-priority LV task(prod) get best treatment
• compressible resources: reclaimed
• non-compressible resources: kill or remove task
EECS 582 – W16
15
Lessons
• Bad
• Jobs are restrictive as the only grouping mechanism for tasks
• One IP address per machine complicates things
• Optimizing for power users at the expense of casual ones
• Good
•
•
•
•
Allocs are useful
Cluster management is more than task management
Introspection is vital
The master is the kernel of a distributed system
EECS 582 – W16
16
Future Work
• Kubernetes
• Evolved from Borg
• An open-source system for automating deployment, operations, and
scaling of containerized applications
• Pods: groups of containers
• Labels
• Replica controller
• Services
EECS 582 – W16
17
Relative Work
• Apache Mesos
• YARN
• Tupperware
EECS 582 – W16
18
Questions?
EECS 582 – W16
19
References
• Borg Paper
• Borg Slides at InfoQ
EECS 582 – W16
20