jagomart
digital resources
picture1_Programming Kubernetes Pdf 198483 | Osdi20 Suresh


 165x       Filetype PDF       File size 0.98 MB       Source: www.usenix.org


File: Programming Kubernetes Pdf 198483 | Osdi20 Suresh
building scalable and flexible cluster managers using declarative programming lalith suresh vmware joao loff ist ulisboa inesc id faria kalim uiuc sangeetha abdu jyothi uc irvine and vmware nina narodytska ...

icon picture PDF Filetype PDF | Posted on 08 Feb 2023 | 2 years ago
Partial capture of text on file.
       Building Scalable and Flexible Cluster Managers 
                Using Declarative Programming
       Lalith Suresh, VMware; João Loff, IST (ULisboa) / INESC-ID; Faria Kalim, UIUC; 
     Sangeetha Abdu Jyothi, UC Irvine and VMware; Nina Narodytska, Leonid Ryzhyk, 
         Sahan Gamage, Brian Oki, Pranshu Jain, and Michael Gasch, VMware
              https://www.usenix.org/conference/osdi20/presentation/suresh
           This paper is included in the Proceedings of the 
           14th USENIX Symposium on Operating Systems 
                    Design and Implementation
                          November 4–6, 2020
                            978-1-939133-19-9
                                Open access to the Proceedings of the 
                                14th USENIX Symposium on Operating 
                                 Systems Design and Implementation 
                                     is sponsored by USENIX
               Building Scalable and Flexible Cluster Managers Using Declarative Programming
                                                 1                 2                               3
                  Lalith Suresh, João Loff , Faria Kalim , Sangeetha Abdu Jyothi , Nina Narodytska, Leonid Ryzhyk,
                                             Sahan Gamage, Brian Oki, Pranshu Jain, Michael Gasch
                                    VMware,1IST(ULisboa)/INESC-ID,2UIUC,3UCIrvineandVMware
                                        Abstract                                       Despite the complexity of the largely similar algorith-
                 Cluster managers like Kubernetes and OpenStack are noto-           mic problems involved, cluster managers in various con-
               riously hard to develop, given that they routinely grapple with      texts tackle the configuration problem using custom, system-
               hard combinatorial optimization problems like load balanc-           specific best-effort heuristics—an approach that often leads
               ing, placement, scheduling, and configuration. Today, clus-           to a software engineering dead-end (§2). As newtypesofpoli-
               ter manager developers tackle these problems by developing           cies are introduced, developers are overwhelmed by having
               system-specific best effort heuristics, which achieve scalabil-       to write code to solve arbitrary combinations of increasingly
               ity by significantly sacrificing the cluster manager’s decision        complex constraints. This is unsurprising given that most
               quality, feature set, and extensibility over time. This is prov-     cluster management problems involve NP-hard combinato-
               ing untenable, as solutions for cluster management problems          rial optimization that cannot be efficiently solved via naive
               are routinely developed from scratch in the industry to solve        heuristics. Besides the algorithmic complexity, the lack of
               largely similar problems across different settings.                  separation between the cluster state, the constraints, and the
                 WeproposeDCM,aradicallydifferent architecture where                constraint-solving algorithm leads to high code complexity
               developers specify the cluster manager’s behavior declara-           and maintainability challenges, and hinders re-use of clus-
               tively, using SQL queries over cluster state stored in a rela-       ter manager code across different settings (§2). In practice,
               tional database. From the SQL specification, the DCM com-             even at a large software vendor we find policy-level feature
               piler synthesizes a program that, at runtime, can be invoked         additions to cluster managers take months to develop.
               to compute policy-compliant cluster management decisions
               given the latest cluster state. Under the covers, the generated      Ourcontribution       This paper presents Declarative Cluster
               program efficiently encodes the cluster state as an optimiza-         Managers(DCM),aradicallydifferent approach to building
               tion problem that can be solved using off-the-shelf solvers,         cluster managers, wherein the implementation to compute
               freeing developers from having to design ad-hoc heuristics.          policy-compliant configurations is synthesized by a compiler
                 WeshowthatDCMsignificantlylowersthebarriertobuild-                  from a high-level specification.
               ing scalable and extensible cluster managers. We validate our           Specifically, developers using DCM maintain cluster state
               claim by powering three production-grade systems with it: a          in a relational database, and declaratively specify the con-
               Kubernetes scheduler, a virtual machine management solu-             straints that the cluster manager should enforce using SQL.
               tion, and a distributed transactional datastore.                     Given this specification, DCM’s compiler synthesizes a pro-
                                                                                    gramthat, at runtime, can be invoked to pull the latest cluster
               1   Introduction                                                     state from the database and compute a set of policy-compliant
                                                                                    changes to make to that state (e.g., compute optimal place-
               Today’s data centers are powered by a variety of cluster man-        ment decisions for newly launched virtual machines). The
               agers like Kubernetes [10], DRS [47], Openstack [15], and            generated program – an encoder – encodes the cluster state
               OpenShift [14]. These systems configure large-scale clusters          and constraints into an optimization model that is then solved
               and allocate resources to jobs. Whether juggling containers,         using a constraint solver.
               virtual machines, micro-services, virtual network appliances,           In doing so, DCM significantly lowers the barrier to build-
               or serverless functions, these systems must enforce numerous         ing cluster managers that achieve all three of scalability, high
               cluster management policies. Some policies represent hard            decision quality, and extensibility to add new features and
               constraints, which must hold in any valid system configura-           policies. In contrast, today’s cluster managers use custom
               tion; e.g., “each container must obtain its minimal requested        heuristics that heavily sacrifice both decision quality and ex-
               amountofdiskspace”. Others are soft constraints, which re-           tensibility to meet scalability goals (§2).
               flect preferences and quality metrics; e.g., “prefer to scatter          For scalability, our compiler generates implementations
               replicas across as many racks as possible”. A cluster manager        that construct highly efficient constraint solver encodings that
               therefore solves a challenging combinatorial optimization            scale to problem sizes in large clusters (e.g., 53% improved
               problemoffindingconfigurationsthatsatisfy hard constraints             p99 placement latency in a 500 node cluster over the heavily
               while minimizing violations of soft constraints.                     optimized Kubernetes scheduler, §6.1).
               USENIX Association                           14th USENIX Symposium on Operating Systems Design and Implementation    827
                      For high decision quality, the use of constraint solvers                                                                  DCM Runtime
                   under-the-covers guarantees optimal solutions for the speci-                                                                          Solver
                   fiedproblems,withthefreedomtorelaxthesolutionqualityif                                                 Constraints.sql   3. Encoder generates 
                   needed(e.g.,4×betterloadbalancinginacommercialvirtual                                    Schema.sql                     optimization model 
                   machine resource manager, §6.2).                                                                                        and invokes solver     4. Solution 2. Encoder 
                                                                                                                                                Optimization                fetches required 
                      Forextensibility, DCM enforces a strict separation between                                                    Code           model                    input data from 
                   the a) cluster state, b) the modular and concise representation                                   DCM           generate          Encoder                database
                   of constraints in SQL, and c) the solving logic. This makes                                    Compiler
                   it easy to add new constraints and non-trivial features (e.g.,                                                       1. User code invokes     5. Return new 
                   makingaKubernetesscheduler place both pods and virtual                                                         generated code via runtime      configuration
                                                                                                                                                   User code
                   machines in a custom Kubernetes distribution, §6.3).                                                                               User code
                      Several research systems [46,53,57,78,92] propose to                                 Figure 1: DCM architecture. Dotted lines show the compila-
                   use constraint solvers for cluster management tasks. These                              tion flow. Solid lines show runtime interactions between the
                   systems all involve a significant amount of effort from opti-                            DCMruntime,usercodeandtheclusterstate DB.
                   mizationexpertstohandcraftanencoderforspecificproblems
                  with simple, well-defined constraints – let alone encode the
                   full complexity and feature sets of production-grade cluster                            to power a commercial virtual machine management solution
                   managers (e.g., Kubernetes has 30 policies for driving ini-                             where we improved load balancing quality by 4×. Lastly, we
                   tial placement alone). Even simple encoders are challenging                             briefly discuss a distributed transactional datastore where we
                   to scale to large problem sizes and are not extensible even                             implemented several features with a few lines of SQL.
                  whentheydoscale(§8). In fact, for these reasons, constraint
                   solvers remain rarely used within production-grade cluster                              2     Motivation
                   managers in the industry-at-large: none of the open-source
                   cluster managers use solvers and, anecdotally, nor do widely                            Our motivating concern is that ad-hoc solutions for cluster
                   used enterprise offerings in this space.                                                managementproblemsare regularly built from scratch in the
                      Instead, with DCM, developers need not handcraft heuris-                             industry, due to the wide range of specialized data-center en-
                   tics nor solver encodings to tackle challenging cluster man-                            vironments and workloads that organizations have, for which
                   agement problems.                                                                       off-the-shelf solutions do not suffice. Even beyond dedi-
                      Providing a capability like DCM is fraught with challenges.                          cated cluster managers like Kubernetes [10], OpenStack [15],
                   First, cluster managers operate in a variety of modes and                               and Nomad [50], similar capabilities are routinely embed-
                   timescales: from incrementally placing new workloads at mil-                            dedwithinenterprise-grade distributed systems like databases
                   lisecond timescales, to periodically performing global recon-                           andstorage systems: e.g., for policy-based configuration, data
                   figuration (like load balancing or descheduling); we design a                            replication, or load-balancing across machines, all of which
                   programming model that is flexible enough to accommodate                                 are combinatorial optimization problems.
                   these various use cases within a single system (§3). Second,                               Today, developers handcraft heuristics to solve these clus-
                   constraint solvers are not a panacea and are notoriously hard                           ter management problems that incur significant engineering
                   to scale to large problem sizes. DCM’s compiler uses care-                              overhead. First, the heuristics are hard to scale to clusters with
                   fully designed optimization passes that bridge the wide chasm                           hundreds to thousands of nodes; they often require purpose-
                   between a high-level SQL specification of a cluster manage-                              built and inflexible pre-computing and caching optimizations
                   ment problem and an efficient, low-level representation of an                            to remain tractable [40,95]. Even then, the heuristics are chal-
                   optimization model – doing so leverages the strengths of the                            lenging to get right as developers have to account for arbitrary
                   constraint solver while avoiding its weaknesses (§4).                                   combinations of constraints. Second, the heuristics sacrifice
                   Summaryofresults Wereportin-depthaboutourexperi-                                        decision quality to scale (e.g., load balancing quality), which
                   ence building and extending a Kubernetes Scheduler using                                is not surprising given that combinatorial optimization prob-
                   DCM.WeimplementexistingpoliciesinKubernetesinunder                                      lems cannot be solved efficiently via naive heuristics. Third,
                   20lines of SQL each. On a 500 node Kubernetes cluster on                                they lead to complex code that makes it hard to extend and
                                                   th                                                      evolve the cluster manager over time; it is not uncommon for
                  AWS,DCMimproves95 percentilepodplacementlatencies                                        policy-level feature additions to take multiple months’ worth
                   byupto2×,is10×morelikelytofindfeasibleplacementsin                                       of effort to deliver.
                   constrained scenarios, and correctly preempts pods 2× faster                               Weillustrate the above challenges using Kubernetes as a
                   than the baseline scheduler. We also report simulation results                          representative example.
                  with up to 10K node clusters and experiment with non-trivial
                   extensions to the scheduler, like placing both pods and VMs                             Kubernetes example              TheKubernetes Scheduler is respon-
                  within a custom Kubernetes distribution. We also use DCM                                 sible for assigning groups of containers, called pods, to cluster
                   828    14th USENIX Symposium on Operating Systems Design and Implementation                                                                  USENIX Association
                                                                                                             Constraints                      Zone 1                    Zone 1
                    Policy    Description                                                                    1. Pod 1 and Pod   Scheduler  Node 1 Node 2            Node 1  Node 2
                    H1-4      Avoid nodes with resource overload, unavailability or errors                   2 cannot be in      Queue
                    H5        Resourcecapacityconstraints:podsscheduledonanodemustnotexceed                  the same zone 
                              node’s CPU, memory, and disk capacity                                          (anti-affinity)      Pod 
                    H6        Ensure network ports on host machine requested by pod are available            2. Pod 1 is affine    1X              Pod               Pod     Pod 
                    H7        Respect requests by a pod for specific nodes                                    to node 1.                             2                 1       2
                    H8        If pod is already assigned to a node, do not reassign                                                                                            X
                                                                                                                 Low Priority     Without cross-node preemption With cross-node preemption
                    H9        Ensure pod is in the same zone as its requested volumes                                                (Pod 1 cannot be placed) (Lower prio. pod preempted)
                    H10-11    If a node has a ‘taint’ label, ensure pods on that node are configured to           High Priority
                              tolerate those taints                                                       Figure 3: An anti-affinity constraint prevents Pod 1 and Pod
                    H12-13    Nodeaffinity/anti-affinity: pods are affine/anti-affine to nodes according
                              to configured labels                                                         2frombeinginthesamezone,pod1isaffinetonode1,and
                    H14       Inter-pod affinity/anti-affinity: pods are affine/anti-affine to each other     pod2hasalowerpriority than pod 1. Placing pod 1 on node
                              according to configured labels
                    H15       Pods of the same service must be in the same failure-domain                 1 requires evicting pod 2 on node 2.
                    H16-20    Volumeconstraints specific to GCE, AWS, Azure.
                    S1        Spread pods from the same group across nodes
                    S2-5      Loadbalance pods according to CPU/Memory load on nodes
                    S6        Prefer nodes that have matching labels                                      and nodes. For instance, Figure 3 shows a scenario where a
                    S7        Inter-pod affinity/anti-affinity by labels
                    S8        Prefer to not exceed node resource limits                                   high priority pod (pod 1) can only be placed on node 1, but to
                    S9        Prefer nodes where container images are already available                   do so, the scheduler has to preempt a lower priority pod on
                  Figure 2: Policies from the baseline Kubernetes scheduler,                              node 2. Computing this rearrangement requires simultaneous
                  showing both hard (H) constraints and soft (S) constraints.                             reasoning about resource and affinity constraints spanning
                                                                                                          multiple pods and nodes, which cannot be achieved in the
                                                                                                          current architecture. Thus, although such global reconfigu-
                  nodes (physical or virtual machines). Each pod has a number                             ration is in high demand among users, it is unsupported in
                  of user-supplied attributes describing its resource demand                              Kubernetes [60,64].
                  (CPU, memory, storage, and custom resources) and place-
                  ment preferences (the pod’s affinity or anti-affinity to other                            Extensibility: Best-effort scheduling leads to complex code
                  pods or nodes). These attributes represent hard constraints                             Similar to Borg [40,95], Kubernetes needs careful engineer-
                  that must be satisfied for the pod to be placed on a node (H1–                           ing to keep scheduling tractable at scale. Several policies like
                  H20 in Table 2). Kubernetes also supports soft versions of                              inter-pod affinity (Table 2-H14) and service affinities (Table 2-
                  placement constraints, with a violation cost assigned to each                           H15)arecomputeintensive because they require reasoning
                  constraint (S1–S9 in Table 2). Like other task-by-task sched-                           over groups of pods. These policies are kept tractable using
                  ulers [15, 94, 95], the Kubernetes default scheduler uses a                             carefully designed caching and pre-computing optimizations
                  greedy, best-effort heuristic to place one task (pod) at a time,                        that are fragile in the face of evolving requirements. For exam-
                  drawn from a queue. For each pod, the scheduler tries to find                            ple,it is hard to extend inter-pod affinity policies to specify the
                  feasible nodes according to the hard constraints, score them                            numberofpodspernode[58,59,61–63],andtherearediscus-
                  according to the soft constraints, and pick the best-scored                             sions among developers to restrict these policies to make the
                  node. Feasibility and scoring are parallelized for speed.                               code efficient [60]. For similar reasons, there are discussions
                                                                                                          amongdevelopers to remove the service affinity policy due to
                  Decision quality: not guaranteed to find feasible, let alone                             accumulating technical debt around its pre-computing opti-
                  optimal, placements            Podscheduling is a variant of the mul-                   mizations [69]. Such complexity accumulates to make entire
                  tidimensional bin packing problem [18,21], which is NP-                                 classes of policies requested by users difficult to implement
                  hard and cannot, in the general case, be solved efficiently                              in the scheduler [60,64,73].
                  with greedy algorithms. This is especially the case when the                                Beyond policy-level extensions, the tight coupling be-
                  scheduling problem is tight due to workload consolidation                               tween the cluster state representation in the scheduler’s data-
                  and users increasingly relying on affinity constraints for per-                          structures and the scheduling logic makes it near impossible
                  formance and availability.                                                              to introduce changes to the underlying abstractions (e.g., ex-
                      Toremainperformant, the Kubernetes scheduler only con-                              tending the scheduler to also place tasks other than pods, like
                  siders a randomsubsetofnodeswhenschedulingapod,which                                    virtual machines [71]) without a complete rewrite [66].
                  might miss feasible nodes [93]. Furthermore, the scheduler
                  maycommittoplacingapodanddenyfeasiblechoices from
                  pods that are already pending in the scheduler’s queue (a                               3     Declarative Programming with DCM
                  commonweaknessamongtask-by-taskschedulers [40]).
                                                                                                          Our position is that developers should specify cluster man-
                  Featurelimitations:best-effortschedulingdoesnotsupport                                  agement policies using a high-level declarative language, and
                  global reconfiguration              Manyscenarios require the sched-                     let an automated tool like DCM generate the logic that effi-
                  uler to simultaneously reconfigure arbitrary groups of pods                              ciently computes policy-compliant decisions. Architecturally,
                  USENIX Association                                        14th USENIX Symposium on Operating Systems Design and Implementation    829
The words contained in this file might help you see if this file matches what you are looking for:

...Building scalable and flexible cluster managers using declarative programming lalith suresh vmware joao loff ist ulisboa inesc id faria kalim uiuc sangeetha abdu jyothi uc irvine nina narodytska leonid ryzhyk sahan gamage brian oki pranshu jain michael gasch https www usenix org conference osdi presentation this paper is included in the proceedings of th symposium on operating systems design implementation november open access to sponsored by ucirvineandvmware abstract despite complexity largely similar algorith like kubernetes openstack are noto mic problems involved various con riously hard develop given that they routinely grapple with texts tackle conguration problem custom system combinatorial optimization load balanc specic best effort heuristics an approach often leads ing placement scheduling today clus a software engineering dead end as newtypesofpoli ter manager developers these developing cies introduced overwhelmed having which achieve scalabil write code solve arbitrary co...

no reviews yet
Please Login to review.