Firmament: Fast, Centralized Cluster Scheduling at Scale Ionel Gog, University of Cambridge; Malte Schwarzkopf, MIT CSAIL; Adam Gleave and Robert N. M. Watson, University of Cambridge; Steven Hand, Google, Inc Proceeding OSDI'16 Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation Pages 99-115 Savannah, GA, USA — November 02 - 04, 2016 USENIX Association Berkeley, CA, USA ©2016 table of contents ISBN: 978-1-931971-33-1 Scheduler Cluster scheduler decides how to place these tasks on cluster machines, where they are instantiated as processes, containers, or VMs. Goals • Low Placement latency • High Quality Placement A good Scheduler Better task placements by the cluster scheduler lead to o Higher machine utilization o Shorter batch job runtime o Improved load balancing o More predictable application performance o Increased fault tolerance Types of Schedulers Centralized ◦ High-quality placement, tends to have high latency Distributed ◦ Low latency, tends to have lower quality placement Hybrid ◦ Split between Centralized and Distributed. Use Distributed component for placing short tasks Firmament Centralized Scheduler Based on Quincy o Experiments show 20X improvement over Quincy, while maintaining nearly same placement quality Flow Based Scheduler o Uses Min-Cost Max-Flow Algorithm Key Insight o Solve the problem incrementally, by using problem-specific optimization Goals for Firmament • to maintain the same high placement quality as an existing centralized scheduler (viz. Quincy) • to achieve sub-second task placement latency for all workloads in the common case • to cope well with demanding situations such as cluster oversubscription or large incoming jobs. Flow Based Scheduler Task by Task Placement ◦ commits to a placement early and restricts its choices for further waiting tasks Batch Placement ◦ best trade-off for the whole batch Flow Based Scheduling ◦ uses min-cost max-flow (MCMF) optimization ◦ MCMF guarantees optimal task placements for a given scheduling policy Flow Network • Directed graph whose arcs carry flow from source nodes to a sink node • All such flow must be drained into the sink node • Capacity associated with each arc constrain the flow • Cost associated with each arc specify preferential routes for it • Structure is defined by the scheduling policy • Load Spreading • Quincy Policy • Network Aware Residual Network FLOW GRAPH RESIDUAL GRAPH Example: Scheduling Policy MCMF Algorithm find a flow f that minimizes Eq. 1 while respecting the feasibility constraints mass balance (Eq. 2) and capacity (Eq. 3) b(i) is the associated supply each node i ∈N has an associated dual variable π(i) called the Potential, which is adjusted to meet optimality conditions the reduced cost with respect to node potential is defined by (Eq. 4) Optimality Conditions A feasible flow is optimal if and only if at least one of three optimality conditions is met: Negative cycle optimality ◦ no directed negative cost cycles exist in the residual network Reduced cost optimality ◦ there is a set of node potentials π such that there are no arcs in the residual network with negative reduced cost Complementary slackness optimality ◦ there is a set of node potentials π such that the flow on arcs with cπi j > 0 is zero, and there are no arcs with both cπi,j < 0 and available capacity Algorithms for MCMF Cycle Canceling The algorithm first computes a maxflow solution, and then performs a series of iterations in which it augments flow along negative-cost directed cycles in the residual network Successive Shortest Path Algorithm It repeatedly selects a source node (i.e., b(i) > 0) and sends flow from it to the sink along the shortest path Algorithms for MCMF Cost Scaling uses a relaxed complementary slackness condition called εoptimality. Initially, ε is equal to the maximum arc cost, but ε rapidly decreases as it is divided by a constant factor after every iteration that achieves ε-optimality Cost Scaling A flow x or a pseudo flow x is said to be Ɛ-optimal for some Ɛ > 0 if for some node potentials Π, the pair (x, Π) satisfies the following Eoptimality conditions: Lemma : For a minimum cost flow problem with integer costs, any feasible flow is Ɛ-optimal whenever Ɛ <= C. Moreover, if Ɛ < 1/n, then any Ɛ-optimal feasible flow is an optimal flow. Cost Scaling Cost Scaling Cost Scaling Algorithms for MCMF Relaxation we identify a set of constraints to be relaxed, multiply each such constraint by a scalar, and subtract the product from the objective function. relaxes the mass balance constraints of the nodes, multiplying the mass balance constraint for node i by an (unrestricted) variable Π(i) Relaxation The algorithm maintains a vector of node potentials Π and a pseudo flow f that is an optimal solution for Π Repeatedly performs one of the following two operations ◦ modifies the flow, f, to f’, so that the excess at at least one node decreases ◦ It modifies π to π’ and f to f’ such that f’ is still a reduced costoptimal solution and the cost of that solution decreases Relaxation Relaxation Algorithm in Practice Edge Case Consideration Optimization (Approximation) MCMF algorithms return an optimal solution. For the cluster scheduling problem, however, an approximate solution may not suffice. Optimization (Incremental Scaling) Cluster state does not change dramatically between subsequent scheduling runs, the MCMF algorithm might be able to reuse its previous state Cost Scaling A change that modifies the cost of an arc (i, j) from cπij < 0 to cπij > 0, breaks optimality. Hence, incremental scaling is costly in these cases Otherwise, it is faster. Since, breaking changes are not very often, scaling is up to 50% faster than running cost scaling from scratch Optimization (Problem Specific Heuristics) Arc Prioritization prioritize arcs that lead to nodes with demand when extending the cut, adding them to the front of a priority queue to ensure they are visited sooner Efficient Task Removal based on the insight that removal of a running task is but breaks feasibility, which is expensive for cost scaling we can reconstruct the task’s flow through the graph, remove it, and drain the machine node’s flow at the single sink node Optimization (Problem Specific Heuristics) Firmament Implementation • MCMF solver always speculatively executes cost scaling and relaxation, and picks the solution offered by whichever algorithm finishes first • Applies an optimization that helps it efficiently transition state from relaxation to incremental cost scaling Firmament Implementation Firmament Solver Interaction Flow network updates Firmament does two breadth first traversals of the flow network to update it for a new solver run ◦ updates resource statistics associated with every entity, ◦ update the flow network’s nodes, arcs, costs and capacities using the statistics gathered in the first traversal Task placement extraction must extract the task placements implied by this flow we devised the graph traversal algorithm Evaluation • In simulations, we replay a public production workload trace from 12,500machine Google cluster • In local cluster experiments, we use a homogeneous 40-machine cluster. Each machine has a Xeon E52430Lv2 CPU (12× 2.40GHz), 64 GB RAM, and uses a 1TB magnetic disk for storage • When we compare with Quincy, we run Firmament with Quincy’s scheduling policy and restrict the solver to use only cost scaling Scalability (vs Quincy) 12,500-machine cluster at 90% slot utilization Quincy takes between 25 and 60 seconds to place tasks, Firmament typically places tasks in hundreds of milliseconds Firmament improves task placement latency by more than a 20× over Quincy Scalability (vs Quincy) • Scalability of flow diagram • Quincy originally picked a threshold of a maximum of ten arcs per task in its flow diagram • To test we pick a lower threshold 14% for Quincy (max 7 arcs per task) • We see runtimes of 20–40 seconds for Quincy’s cost scaling, while Firmament is vastly faster Coping with demanding situations Performance (vs other Schedulers) • Local 40-machine cluster • Simulate real world workload with short and long jobs • Firmament uses the network-aware scheduling policy • Firmament’s 99th percentile response time is 3.4× better than the SwarmKit and Kubernetes ones, and 6.2× better than Sparrow’s. Real World Performance Conclusion • can scale to large clusters at low placement latencies • chooses the same high-quality placements as an advanced centralized scheduler • Huge improvement over Quincy Links and References https://www.usenix.org/conference/osdi16/technical-sessions/presentation/gog https://www.youtube.com/watch?v=UdtwpgjfR3g
© Copyright 2026 Paperzz