Firmament

Firmament: Fast, Centralized Cluster
Scheduling at Scale
Ionel Gog, University of Cambridge; Malte Schwarzkopf, MIT CSAIL; Adam Gleave and Robert
N. M. Watson, University of Cambridge; Steven Hand, Google, Inc
Proceeding
OSDI'16 Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation
Pages 99-115
Savannah, GA, USA — November 02 - 04, 2016
USENIX Association Berkeley, CA, USA ©2016
table of contents ISBN: 978-1-931971-33-1
Scheduler
Cluster scheduler decides how to place these tasks on cluster
machines, where they are instantiated as processes, containers, or
VMs.
Goals
• Low Placement latency
• High Quality Placement
A good Scheduler
Better task placements by the cluster scheduler lead to
o Higher machine utilization
o Shorter batch job runtime
o Improved load balancing
o More predictable application performance
o Increased fault tolerance
Types of Schedulers
Centralized
◦ High-quality placement, tends to have high latency
Distributed
◦ Low latency, tends to have lower quality placement
Hybrid
◦ Split between Centralized and Distributed. Use Distributed
component for placing short tasks
Firmament
Centralized Scheduler
Based on Quincy
o Experiments show 20X improvement over Quincy, while maintaining nearly
same placement quality
Flow Based Scheduler
o Uses Min-Cost Max-Flow Algorithm
Key Insight
o Solve the problem incrementally, by using problem-specific
optimization
Goals for Firmament
• to maintain the same high placement quality as an existing
centralized scheduler (viz. Quincy)
• to achieve sub-second task placement latency for all workloads in
the common case
• to cope well with demanding situations such as cluster
oversubscription or large incoming jobs.
Flow Based Scheduler
Task by Task Placement
◦ commits to a placement early and restricts its choices for further waiting tasks
Batch Placement
◦ best trade-off for the whole batch
Flow Based Scheduling
◦ uses min-cost max-flow (MCMF) optimization
◦ MCMF guarantees optimal task placements for a given scheduling policy
Flow Network
• Directed graph whose arcs carry flow from source nodes to a sink node
• All such flow must be drained into the sink node
• Capacity associated with each arc constrain the flow
• Cost associated with each arc specify preferential routes for it
• Structure is defined by the scheduling policy
• Load Spreading
• Quincy Policy
• Network Aware
Residual Network
FLOW GRAPH
RESIDUAL GRAPH
Example:
Scheduling Policy
MCMF Algorithm
find a flow f that minimizes Eq. 1
while respecting the feasibility
constraints mass balance (Eq. 2) and
capacity (Eq. 3)
b(i) is the associated supply
each node i ∈N has an associated
dual variable π(i) called the Potential,
which is adjusted to meet optimality
conditions
the reduced cost with respect to
node potential is defined by (Eq. 4)
Optimality Conditions
A feasible flow is optimal if and only if at least one of three optimality conditions is met:
Negative cycle optimality
◦ no directed negative cost cycles exist in the residual network
Reduced cost optimality
◦ there is a set of node potentials π such that there are no arcs in the residual
network with negative reduced cost
Complementary slackness optimality
◦ there is a set of node potentials π such that the flow on arcs with cπi j > 0 is
zero, and there are no arcs with both cπi,j < 0 and available capacity
Algorithms for MCMF
Cycle Canceling
The algorithm first computes a maxflow solution, and then performs
a series of iterations in which it augments flow along negative-cost
directed cycles in the residual network
Successive Shortest Path Algorithm
It repeatedly selects a source node (i.e., b(i) > 0) and sends flow from
it to the sink along the shortest path
Algorithms for MCMF
Cost Scaling
uses a relaxed complementary slackness condition called εoptimality.
Initially, ε is equal to the maximum arc cost, but ε rapidly
decreases as it is divided by a constant factor after every
iteration that achieves ε-optimality
Cost Scaling
A flow x or a pseudo flow x is said to be Ɛ-optimal for some Ɛ > 0 if for
some node potentials Π, the pair (x, Π) satisfies the following Eoptimality conditions:
Lemma : For a minimum cost flow problem with integer costs, any
feasible flow is Ɛ-optimal whenever Ɛ <= C. Moreover, if Ɛ < 1/n, then
any Ɛ-optimal feasible flow is an optimal flow.
Cost Scaling
Cost Scaling
Cost Scaling
Algorithms for MCMF
Relaxation
we identify a set of constraints to be relaxed, multiply each such
constraint by a scalar, and subtract the product from the objective
function.
relaxes the mass balance constraints of the nodes, multiplying the mass
balance constraint for node i by an (unrestricted) variable Π(i)
Relaxation
The algorithm maintains a vector of node potentials Π and
a pseudo flow f that is an optimal solution for Π
Repeatedly performs one of the following two operations
◦ modifies the flow, f, to f’, so that the excess at at least one node
decreases
◦ It modifies π to π’ and f to f’ such that f’ is still a reduced costoptimal solution and the cost of that solution decreases
Relaxation
Relaxation
Algorithm in Practice
Edge Case Consideration
Optimization (Approximation)
MCMF algorithms return an optimal solution. For the cluster scheduling problem, however, an
approximate solution may not suffice.
Optimization (Incremental Scaling)
Cluster state does not change dramatically between subsequent
scheduling runs, the MCMF algorithm might be able to reuse its
previous state
Cost Scaling
A change that modifies the cost of an arc (i, j) from cπij < 0 to cπij > 0,
breaks optimality. Hence, incremental scaling is costly in these cases
Otherwise, it is faster. Since, breaking changes are not very often,
scaling is up to 50% faster than running cost scaling from scratch
Optimization (Problem Specific Heuristics)
Arc Prioritization
prioritize arcs that lead to nodes with demand when extending the
cut, adding them to the front of a priority queue to ensure they are
visited sooner
Efficient Task Removal
based on the insight that removal of a running task is but breaks
feasibility, which is expensive for cost scaling
we can reconstruct the task’s flow through the graph, remove it, and
drain the machine node’s flow at the single sink node
Optimization (Problem Specific
Heuristics)
Firmament Implementation
• MCMF solver always speculatively executes cost scaling and
relaxation, and picks the solution offered by whichever
algorithm finishes first
• Applies an optimization that helps it efficiently transition
state from relaxation to incremental cost scaling
Firmament Implementation
Firmament Solver Interaction
Flow network updates
Firmament does two breadth first traversals of the flow network to update it for a new solver
run
◦ updates resource statistics associated with every entity,
◦ update the flow network’s nodes, arcs, costs and capacities using the statistics gathered in the first
traversal
Task placement extraction
must extract the task placements implied by this flow
we devised the graph traversal algorithm
Evaluation
• In simulations, we replay a public production workload trace from 12,500machine Google cluster
• In local cluster experiments, we use a homogeneous 40-machine cluster. Each
machine has a Xeon E52430Lv2 CPU (12× 2.40GHz), 64 GB RAM, and uses a 1TB
magnetic disk for storage
• When we compare with Quincy, we run Firmament with Quincy’s scheduling
policy and restrict the solver to use only cost scaling
Scalability (vs Quincy)
12,500-machine cluster at 90% slot
utilization
Quincy takes between 25 and 60
seconds to place tasks, Firmament
typically places tasks in hundreds of
milliseconds
Firmament improves task placement
latency by more than a 20× over
Quincy
Scalability (vs Quincy)
• Scalability of flow diagram
• Quincy originally picked a threshold of
a maximum of ten arcs per task in its
flow diagram
• To test we pick a lower threshold 14%
for Quincy (max 7 arcs per task)
• We see runtimes of 20–40 seconds for
Quincy’s cost scaling, while Firmament
is vastly faster
Coping with demanding situations
Performance (vs other Schedulers)
• Local 40-machine cluster
• Simulate real world workload with short and long jobs
• Firmament uses the network-aware scheduling policy
• Firmament’s 99th percentile response time is 3.4× better than the
SwarmKit and Kubernetes ones, and 6.2× better than Sparrow’s.
Real World Performance
Conclusion
• can scale to large clusters at low placement latencies
• chooses the same high-quality placements as an advanced
centralized scheduler
• Huge improvement over Quincy
Links and References
https://www.usenix.org/conference/osdi16/technical-sessions/presentation/gog
https://www.youtube.com/watch?v=UdtwpgjfR3g