165x Filetype PDF File size 0.98 MB Source: www.usenix.org
Building Scalable and Flexible Cluster Managers Using Declarative Programming Lalith Suresh, VMware; João Loff, IST (ULisboa) / INESC-ID; Faria Kalim, UIUC; Sangeetha Abdu Jyothi, UC Irvine and VMware; Nina Narodytska, Leonid Ryzhyk, Sahan Gamage, Brian Oki, Pranshu Jain, and Michael Gasch, VMware https://www.usenix.org/conference/osdi20/presentation/suresh This paper is included in the Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation November 4–6, 2020 978-1-939133-19-9 Open access to the Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX Building Scalable and Flexible Cluster Managers Using Declarative Programming 1 2 3 Lalith Suresh, João Loff , Faria Kalim , Sangeetha Abdu Jyothi , Nina Narodytska, Leonid Ryzhyk, Sahan Gamage, Brian Oki, Pranshu Jain, Michael Gasch VMware,1IST(ULisboa)/INESC-ID,2UIUC,3UCIrvineandVMware Abstract Despite the complexity of the largely similar algorith- Cluster managers like Kubernetes and OpenStack are noto- mic problems involved, cluster managers in various con- riously hard to develop, given that they routinely grapple with texts tackle the configuration problem using custom, system- hard combinatorial optimization problems like load balanc- specific best-effort heuristics—an approach that often leads ing, placement, scheduling, and configuration. Today, clus- to a software engineering dead-end (§2). As newtypesofpoli- ter manager developers tackle these problems by developing cies are introduced, developers are overwhelmed by having system-specific best effort heuristics, which achieve scalabil- to write code to solve arbitrary combinations of increasingly ity by significantly sacrificing the cluster manager’s decision complex constraints. This is unsurprising given that most quality, feature set, and extensibility over time. This is prov- cluster management problems involve NP-hard combinato- ing untenable, as solutions for cluster management problems rial optimization that cannot be efficiently solved via naive are routinely developed from scratch in the industry to solve heuristics. Besides the algorithmic complexity, the lack of largely similar problems across different settings. separation between the cluster state, the constraints, and the WeproposeDCM,aradicallydifferent architecture where constraint-solving algorithm leads to high code complexity developers specify the cluster manager’s behavior declara- and maintainability challenges, and hinders re-use of clus- tively, using SQL queries over cluster state stored in a rela- ter manager code across different settings (§2). In practice, tional database. From the SQL specification, the DCM com- even at a large software vendor we find policy-level feature piler synthesizes a program that, at runtime, can be invoked additions to cluster managers take months to develop. to compute policy-compliant cluster management decisions given the latest cluster state. Under the covers, the generated Ourcontribution This paper presents Declarative Cluster program efficiently encodes the cluster state as an optimiza- Managers(DCM),aradicallydifferent approach to building tion problem that can be solved using off-the-shelf solvers, cluster managers, wherein the implementation to compute freeing developers from having to design ad-hoc heuristics. policy-compliant configurations is synthesized by a compiler WeshowthatDCMsignificantlylowersthebarriertobuild- from a high-level specification. ing scalable and extensible cluster managers. We validate our Specifically, developers using DCM maintain cluster state claim by powering three production-grade systems with it: a in a relational database, and declaratively specify the con- Kubernetes scheduler, a virtual machine management solu- straints that the cluster manager should enforce using SQL. tion, and a distributed transactional datastore. Given this specification, DCM’s compiler synthesizes a pro- gramthat, at runtime, can be invoked to pull the latest cluster 1 Introduction state from the database and compute a set of policy-compliant changes to make to that state (e.g., compute optimal place- Today’s data centers are powered by a variety of cluster man- ment decisions for newly launched virtual machines). The agers like Kubernetes [10], DRS [47], Openstack [15], and generated program – an encoder – encodes the cluster state OpenShift [14]. These systems configure large-scale clusters and constraints into an optimization model that is then solved and allocate resources to jobs. Whether juggling containers, using a constraint solver. virtual machines, micro-services, virtual network appliances, In doing so, DCM significantly lowers the barrier to build- or serverless functions, these systems must enforce numerous ing cluster managers that achieve all three of scalability, high cluster management policies. Some policies represent hard decision quality, and extensibility to add new features and constraints, which must hold in any valid system configura- policies. In contrast, today’s cluster managers use custom tion; e.g., “each container must obtain its minimal requested heuristics that heavily sacrifice both decision quality and ex- amountofdiskspace”. Others are soft constraints, which re- tensibility to meet scalability goals (§2). flect preferences and quality metrics; e.g., “prefer to scatter For scalability, our compiler generates implementations replicas across as many racks as possible”. A cluster manager that construct highly efficient constraint solver encodings that therefore solves a challenging combinatorial optimization scale to problem sizes in large clusters (e.g., 53% improved problemoffindingconfigurationsthatsatisfy hard constraints p99 placement latency in a 500 node cluster over the heavily while minimizing violations of soft constraints. optimized Kubernetes scheduler, §6.1). USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 827 For high decision quality, the use of constraint solvers DCM Runtime under-the-covers guarantees optimal solutions for the speci- Solver fiedproblems,withthefreedomtorelaxthesolutionqualityif Constraints.sql 3. Encoder generates needed(e.g.,4×betterloadbalancinginacommercialvirtual Schema.sql optimization model machine resource manager, §6.2). and invokes solver 4. Solution 2. Encoder Optimization fetches required Forextensibility, DCM enforces a strict separation between Code model input data from the a) cluster state, b) the modular and concise representation DCM generate Encoder database of constraints in SQL, and c) the solving logic. This makes Compiler it easy to add new constraints and non-trivial features (e.g., 1. User code invokes 5. Return new makingaKubernetesscheduler place both pods and virtual generated code via runtime configuration User code machines in a custom Kubernetes distribution, §6.3). User code Several research systems [46,53,57,78,92] propose to Figure 1: DCM architecture. Dotted lines show the compila- use constraint solvers for cluster management tasks. These tion flow. Solid lines show runtime interactions between the systems all involve a significant amount of effort from opti- DCMruntime,usercodeandtheclusterstate DB. mizationexpertstohandcraftanencoderforspecificproblems with simple, well-defined constraints – let alone encode the full complexity and feature sets of production-grade cluster to power a commercial virtual machine management solution managers (e.g., Kubernetes has 30 policies for driving ini- where we improved load balancing quality by 4×. Lastly, we tial placement alone). Even simple encoders are challenging briefly discuss a distributed transactional datastore where we to scale to large problem sizes and are not extensible even implemented several features with a few lines of SQL. whentheydoscale(§8). In fact, for these reasons, constraint solvers remain rarely used within production-grade cluster 2 Motivation managers in the industry-at-large: none of the open-source cluster managers use solvers and, anecdotally, nor do widely Our motivating concern is that ad-hoc solutions for cluster used enterprise offerings in this space. managementproblemsare regularly built from scratch in the Instead, with DCM, developers need not handcraft heuris- industry, due to the wide range of specialized data-center en- tics nor solver encodings to tackle challenging cluster man- vironments and workloads that organizations have, for which agement problems. off-the-shelf solutions do not suffice. Even beyond dedi- Providing a capability like DCM is fraught with challenges. cated cluster managers like Kubernetes [10], OpenStack [15], First, cluster managers operate in a variety of modes and and Nomad [50], similar capabilities are routinely embed- timescales: from incrementally placing new workloads at mil- dedwithinenterprise-grade distributed systems like databases lisecond timescales, to periodically performing global recon- andstorage systems: e.g., for policy-based configuration, data figuration (like load balancing or descheduling); we design a replication, or load-balancing across machines, all of which programming model that is flexible enough to accommodate are combinatorial optimization problems. these various use cases within a single system (§3). Second, Today, developers handcraft heuristics to solve these clus- constraint solvers are not a panacea and are notoriously hard ter management problems that incur significant engineering to scale to large problem sizes. DCM’s compiler uses care- overhead. First, the heuristics are hard to scale to clusters with fully designed optimization passes that bridge the wide chasm hundreds to thousands of nodes; they often require purpose- between a high-level SQL specification of a cluster manage- built and inflexible pre-computing and caching optimizations ment problem and an efficient, low-level representation of an to remain tractable [40,95]. Even then, the heuristics are chal- optimization model – doing so leverages the strengths of the lenging to get right as developers have to account for arbitrary constraint solver while avoiding its weaknesses (§4). combinations of constraints. Second, the heuristics sacrifice Summaryofresults Wereportin-depthaboutourexperi- decision quality to scale (e.g., load balancing quality), which ence building and extending a Kubernetes Scheduler using is not surprising given that combinatorial optimization prob- DCM.WeimplementexistingpoliciesinKubernetesinunder lems cannot be solved efficiently via naive heuristics. Third, 20lines of SQL each. On a 500 node Kubernetes cluster on they lead to complex code that makes it hard to extend and th evolve the cluster manager over time; it is not uncommon for AWS,DCMimproves95 percentilepodplacementlatencies policy-level feature additions to take multiple months’ worth byupto2×,is10×morelikelytofindfeasibleplacementsin of effort to deliver. constrained scenarios, and correctly preempts pods 2× faster Weillustrate the above challenges using Kubernetes as a than the baseline scheduler. We also report simulation results representative example. with up to 10K node clusters and experiment with non-trivial extensions to the scheduler, like placing both pods and VMs Kubernetes example TheKubernetes Scheduler is respon- within a custom Kubernetes distribution. We also use DCM sible for assigning groups of containers, called pods, to cluster 828 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association Constraints Zone 1 Zone 1 Policy Description 1. Pod 1 and Pod Scheduler Node 1 Node 2 Node 1 Node 2 H1-4 Avoid nodes with resource overload, unavailability or errors 2 cannot be in Queue H5 Resourcecapacityconstraints:podsscheduledonanodemustnotexceed the same zone node’s CPU, memory, and disk capacity (anti-affinity) Pod H6 Ensure network ports on host machine requested by pod are available 2. Pod 1 is affine 1X Pod Pod Pod H7 Respect requests by a pod for specific nodes to node 1. 2 1 2 H8 If pod is already assigned to a node, do not reassign X Low Priority Without cross-node preemption With cross-node preemption H9 Ensure pod is in the same zone as its requested volumes (Pod 1 cannot be placed) (Lower prio. pod preempted) H10-11 If a node has a ‘taint’ label, ensure pods on that node are configured to High Priority tolerate those taints Figure 3: An anti-affinity constraint prevents Pod 1 and Pod H12-13 Nodeaffinity/anti-affinity: pods are affine/anti-affine to nodes according to configured labels 2frombeinginthesamezone,pod1isaffinetonode1,and H14 Inter-pod affinity/anti-affinity: pods are affine/anti-affine to each other pod2hasalowerpriority than pod 1. Placing pod 1 on node according to configured labels H15 Pods of the same service must be in the same failure-domain 1 requires evicting pod 2 on node 2. H16-20 Volumeconstraints specific to GCE, AWS, Azure. S1 Spread pods from the same group across nodes S2-5 Loadbalance pods according to CPU/Memory load on nodes S6 Prefer nodes that have matching labels and nodes. For instance, Figure 3 shows a scenario where a S7 Inter-pod affinity/anti-affinity by labels S8 Prefer to not exceed node resource limits high priority pod (pod 1) can only be placed on node 1, but to S9 Prefer nodes where container images are already available do so, the scheduler has to preempt a lower priority pod on Figure 2: Policies from the baseline Kubernetes scheduler, node 2. Computing this rearrangement requires simultaneous showing both hard (H) constraints and soft (S) constraints. reasoning about resource and affinity constraints spanning multiple pods and nodes, which cannot be achieved in the current architecture. Thus, although such global reconfigu- nodes (physical or virtual machines). Each pod has a number ration is in high demand among users, it is unsupported in of user-supplied attributes describing its resource demand Kubernetes [60,64]. (CPU, memory, storage, and custom resources) and place- ment preferences (the pod’s affinity or anti-affinity to other Extensibility: Best-effort scheduling leads to complex code pods or nodes). These attributes represent hard constraints Similar to Borg [40,95], Kubernetes needs careful engineer- that must be satisfied for the pod to be placed on a node (H1– ing to keep scheduling tractable at scale. Several policies like H20 in Table 2). Kubernetes also supports soft versions of inter-pod affinity (Table 2-H14) and service affinities (Table 2- placement constraints, with a violation cost assigned to each H15)arecomputeintensive because they require reasoning constraint (S1–S9 in Table 2). Like other task-by-task sched- over groups of pods. These policies are kept tractable using ulers [15, 94, 95], the Kubernetes default scheduler uses a carefully designed caching and pre-computing optimizations greedy, best-effort heuristic to place one task (pod) at a time, that are fragile in the face of evolving requirements. For exam- drawn from a queue. For each pod, the scheduler tries to find ple,it is hard to extend inter-pod affinity policies to specify the feasible nodes according to the hard constraints, score them numberofpodspernode[58,59,61–63],andtherearediscus- according to the soft constraints, and pick the best-scored sions among developers to restrict these policies to make the node. Feasibility and scoring are parallelized for speed. code efficient [60]. For similar reasons, there are discussions amongdevelopers to remove the service affinity policy due to Decision quality: not guaranteed to find feasible, let alone accumulating technical debt around its pre-computing opti- optimal, placements Podscheduling is a variant of the mul- mizations [69]. Such complexity accumulates to make entire tidimensional bin packing problem [18,21], which is NP- classes of policies requested by users difficult to implement hard and cannot, in the general case, be solved efficiently in the scheduler [60,64,73]. with greedy algorithms. This is especially the case when the Beyond policy-level extensions, the tight coupling be- scheduling problem is tight due to workload consolidation tween the cluster state representation in the scheduler’s data- and users increasingly relying on affinity constraints for per- structures and the scheduling logic makes it near impossible formance and availability. to introduce changes to the underlying abstractions (e.g., ex- Toremainperformant, the Kubernetes scheduler only con- tending the scheduler to also place tasks other than pods, like siders a randomsubsetofnodeswhenschedulingapod,which virtual machines [71]) without a complete rewrite [66]. might miss feasible nodes [93]. Furthermore, the scheduler maycommittoplacingapodanddenyfeasiblechoices from pods that are already pending in the scheduler’s queue (a 3 Declarative Programming with DCM commonweaknessamongtask-by-taskschedulers [40]). Our position is that developers should specify cluster man- Featurelimitations:best-effortschedulingdoesnotsupport agement policies using a high-level declarative language, and global reconfiguration Manyscenarios require the sched- let an automated tool like DCM generate the logic that effi- uler to simultaneously reconfigure arbitrary groups of pods ciently computes policy-compliant decisions. Architecturally, USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 829
no reviews yet
Please Login to review.