Saturday, September 24, 2022
HomeArtificial IntelligenceScaling Kubernetes to 7,500 Nodes

Scaling Kubernetes to 7,500 Nodes


We have scaled Kubernetes clusters to 7,500 nodes, producing a scalable infrastructure for big fashions like GPT-3, CLIP, and DALL·E, but additionally for fast small-scale iterative analysis similar to Scaling Legal guidelines for Neural Language Fashions. Scaling a single Kubernetes cluster to this dimension is never executed and requires some particular care, however the upside is an easy infrastructure that enables our machine studying analysis groups to maneuver sooner and scale up with out altering their code.
Since our final publish on Scaling to 2,500 Nodes we have continued to develop our infrastructure to satisfy researcher wants, within the course of studying many extra classes. This publish summarizes these classes in order that others within the Kubernetes neighborhood can profit from them, and ends with issues we nonetheless face that we’ll be tackling subsequent.

Our Workload
Earlier than we get too far, it’s vital to explain our workload. The purposes and {hardware} we run with Kubernetes are fairly totally different from what you might encounter at a typical firm. Our issues and corresponding options could, or could not, be a very good match to your personal setup!
A big machine studying job spans many nodes and runs most effectively when it has entry to all the {hardware} assets on every node. This permits GPUs to cross-communicate immediately utilizing NVLink, or GPUs to immediately talk with the NIC utilizing GPUDirect. So for a lot of of our workloads, a single pod occupies the complete node. Any NUMA, CPU, or PCIE useful resource rivalry aren’t components for scheduling. Bin-packing or fragmentation is just not a standard downside. Our present clusters have full bisection bandwidth, so we additionally don’t make any rack or community topology issues. All of which means that, whereas we’ve got many nodes, there’s comparatively low pressure on the scheduler.
That stated, pressure on the kube-scheduler is spiky. A brand new job could include many lots of of pods all being created without delay, then return to a comparatively low fee of churn.

Our greatest jobs run MPI, and all pods throughout the job are taking part in a single MPI communicator. If any of the taking part pods die, the complete job halts and must be restarted. The job checkpoints frequently, and when restarted it resumes from the final checkpoint. Thus we think about the pods to be semi-stateful—killed pods will be changed and work can proceed, however doing so is disruptive and must be stored to a minimal.
We don’t depend on Kubernetes load balancing all that a lot. We’ve little or no HTTPS site visitors, without having for A/B testing, blue/inexperienced, or canaries. Pods talk immediately with each other on their pod IP addresses with MPI by way of SSH, not service endpoints. Service “discovery” is proscribed; we simply do a one-time lookup for which pods are taking part in MPI at job startup time.
Most jobs work together with some type of blob storage. They normally both stream some shards of a dataset or checkpoint immediately from blob storage, or cache it to a quick native ephemeral disk. We’ve a couple of PersistentVolumes for circumstances the place POSIX semantics are helpful, however blob storage is way extra scalable and doesn’t require gradual detach/connect operations.
Lastly, the character of our work is essentially analysis, which implies the workloads themselves are ever-changing. Whereas the Supercomputing group strives to supply what we’d think about a “manufacturing” high quality degree of compute infrastructure, the purposes that run on that cluster are short-lived and their builders iterate rapidly. New utilization patterns could emerge at any time that problem our assumptions about traits and applicable tradeoffs. We want a sustainable system that additionally permits us to reply rapidly when issues change.
Because the variety of nodes and pods inside our clusters elevated, we discovered that Flannel had difficulties scaling up the throughput required. We switched to utilizing the native pod networking applied sciences for our IP Configurations for Azure VMSSes and the related CNI plugins. This allowed us to get host degree community throughput on our pods.
One more reason we have switched to utilizing alias-based IP addressing is that on our largest clusters, we may presumably have roughly 200,000 IP addresses in use at anybody time. Once we examined route-based pod networking, we discovered there have been important limitations within the variety of routes we may successfully use.
Avoiding encapsulation will increase the calls for on the underlying SDN or routing engine, however it retains our networking setup easy. Including VPN or tunneling will be executed with none extra adapters. We need not fear about packet fragmentation attributable to some portion of the community having a decrease MTU. Community insurance policies and site visitors monitoring is simple; there is no ambiguity concerning the supply and vacation spot of packets.
We use iptables tagging on the host to trace community useful resource utilization per Namespace and pod. This lets researchers visualize their community utilization patterns. Particularly, since quite a lot of our experiments have distinct Web and intra-pod communication patterns, it is usually helpful to have the ability to examine the place any bottlenecks may be occurring.
iptables mangle guidelines can be utilized to arbitrarily mark packets that match specific standards. Listed below are our guidelines to detect whether or not site visitors is inside or internet-bound. The FORWARD guidelines cowl site visitors from pods, vs INPUT and OUTPUT site visitors from the host:
iptables -t mangle -A INPUT ! -s -m remark –comment “iptables-exporter openai site visitors=internet-in”
iptables -t mangle -A FORWARD ! -s -m remark –comment “iptables-exporter openai site visitors=internet-in”
iptables -t mangle -A OUTPUT ! -d -m remark –comment “iptables-exporter openai site visitors=internet-out”
iptables -t mangle -A FORWARD ! -d -m remark –comment “iptables-exporter openai site visitors=internet-out”

As soon as marked, iptables will begin counters to trace the variety of bytes and packets that match this rule. You’ll be able to eyeball these counters through the use of iptables itself:
% iptables -t mangle -L -v
Chain FORWARD (coverage ACCEPT 50M packets, 334G bytes)
pkts bytes goal prot choose in out supply vacation spot
1253K 555M all — any any anyplace ! /* iptables-exporter openai site visitors=internet-out */
1161K 7937M all — any any ! anyplace /* iptables-exporter openai site visitors=internet-in */

We use an open-source Prometheus exporter known as iptables-exporter to then get these tracked into our monitoring system. This a easy solution to observe packets matching a wide range of various kinds of circumstances.

One considerably distinctive side of our community mannequin is that we totally expose the node, pod, and repair community CIDR ranges to our researchers. We’ve a hub and spoke community mannequin, and use the native node and pod CIDR ranges to route that site visitors. Researchers hook up with the hub, and from there have entry to any of the person clusters (the spokes). However the clusters themselves can’t discuss to at least one one other. This ensures that clusters stay remoted with no cross-cluster dependencies that may break failure isolation.
We use a “NAT” host to translate the service community CIDR vary for site visitors coming from exterior of the cluster. This setup permits our researchers important flexibility in selecting how and what sorts of community configurations they can select from for his or her experiments.
API Servers
Kubernetes API Servers and etcd are vital elements to a wholesome working cluster, so we pay particular consideration to the stress on these programs. We use the Grafana dashboards offered by kube-prometheus, in addition to extra in-house dashboards. We’ve discovered it helpful to alert on the speed of HTTP standing 429 (Too Many Requests) and 5xx (Server Error) on the API Servers as a high-level sign of issues.

Whereas some people run API Servers inside kube, we’ve at all times run them exterior the cluster itself. Each etcd and API servers run on their very own devoted nodes. Our largest clusters run 5 API servers and 5 etcd nodes to unfold the load and decrease influence if one had been to ever go down. We’ve had no notable hassle with etcd since splitting out Kubernetes Occasions into their very own etcd cluster again in our final weblog publish. API Servers are stateless and usually straightforward to run in a self-healing occasion group or scaleset. We haven’t but tried to construct any self-healing automation of etcd clusters as a result of incidents have been extraordinarily uncommon.
API Servers can take up a good bit of reminiscence, and that tends to scale linearly with the variety of nodes within the cluster. For our cluster with 7,500 nodes we observe as much as 70GB of heap getting used per API Server, so thankfully this could proceed to be well-within {hardware} capabilities into the longer term.

One massive pressure on API Servers was WATCHes on Endpoints. There are a couple of companies, similar to ‘kubelet’ and ‘node-exporter’ of which each and every node within the cluster is a member. When a node could be added or faraway from the cluster, this WATCH would fireplace. And since sometimes every node itself was watching the kubelet service by way of kube-proxy, the # and bandwidth required in these responses could be $N^2$ and large, often 1GB/s or extra. EndpointSlices, launched in Kubernetes 1.17, had been an enormous profit that introduced this load down 1000x.

Normally we’re very aware of any API Server requests that scale with the dimensions of the cluster. We attempt to keep away from having any DaemonSets work together with the API Server. In circumstances the place you do want every node to look at for modifications, introducing an middleman caching service, such because the Datadog Cluster Agent, appears to be a very good sample to keep away from cluster-wide bottlenecks.
As our clusters have grown, we do much less precise autoscaling of our clusters. However we’ve got run into hassle often when autoscaling an excessive amount of without delay. There are numerous requests generated when a brand new node joins a cluster, and including lots of of nodes without delay can overload API server capability. Smoothing this out, even simply by a couple of seconds, has helped keep away from outages.
Time-Collection Metrics with Prometheus and Grafana
We use Prometheus to gather time-series metrics and Grafana for graphs, dashboards, and alerts. We began with a deployment of kube-prometheus that collects all kinds of metrics and good dashboards for visualization. Over time we’ve added a lot of our personal dashboards, metrics, and alerts.
As we added an increasing number of nodes, we struggled with the sheer quantity of metrics being collected by Prometheus. Whereas kube-prometheus exposes quite a lot of helpful knowledge, a few of it we weren’t really ever , and a few was simply too granular to gather, retailer, and question successfully. We use Prometheus guidelines to “drop” a few of these metrics from being ingested.
For some time we struggled with an issue the place Prometheus would eat an increasing number of reminiscence till ultimately crashing the container in an Out-Of-Reminiscence error (OOM). This appeared to happen even after throwing monumental quantities of reminiscence capability on the software. What’s worse was, when it did crash, it will take many hours on startup replaying write-ahead-log recordsdata earlier than it was usable once more.
Ultimately we tracked down the supply of those OOMs to be an interplay between Grafana and Prometheus, the place Grafana would use the /api/v1/sequence API on Prometheus with a question of {le!=””} (Principally, “give me all of the histogram metrics”). The implementation of /api/v1/sequence was unbounded in each time and area—for a question with quite a lot of outcomes, this could proceed to eat ever-more reminiscence and time. It additionally continues to develop even after the requester has given up and closed the connection. For us, there was by no means sufficient reminiscence, and Prometheus would ultimately crash. We patched Prometheus to comprise this API inside a Context to implement a timeout, which mounted it totally.
Whereas Prometheus crashed far much less usually, in occasions after we did have to restart it, WAL replay remained a difficulty. It will usually take many hours to replay by way of all WAL logs earlier than Prometheus was up amassing new metrics and servicing queries. With assist from Strong Notion, we discovered that making use of a GOMAXPROCS=24 had a giant enchancment. Prometheus tries to make use of all cores when throughout WAL replay, and for servers with numerous cores, the rivalry kills all efficiency.
We’re exploring new choices to extend our monitoring capability, described within the “Unsolved issues” part beneath.
With a cluster this massive, we in fact depend on automation to detect and take away misbehaving nodes from the cluster. Over time we’ve got constructed up plenty of healthcheck programs.
Passive Healthchecks
Some healthchecks are passive, at all times working on all nodes. These monitor fundamental system assets similar to community reachability, unhealthy or full disks, or GPU errors. GPUs exhibit issues plenty of alternative ways, however a straightforward frequent one is an “Uncorrectable ECC error.” Nvidia’s Knowledge Heart GPU Supervisor (DCGM) instruments make it straightforward to question for this and plenty of different “Xid” errors. A method we observe these errors is by way of dcgm-exporter to ingest the metrics into Prometheus, our monitoring system. This may seem because the DCGM_FI_DEV_XID_ERRORS metric and be set to the error code that has most just lately occurred. Moreover, the NVML System Question API exposes extra detailed details about the well being and operation of a GPU.
As soon as we detect an error, they’ll usually be mounted by resetting the GPU or system, although in some circumstances it does result in the underlying GPU needing to be bodily changed.
One other type of healthcheck tracks upkeep occasions from the upstream cloud supplier. Every of the main cloud suppliers expose a solution to know if the present VM is due for an upcoming upkeep occasion that can ultimately trigger a disruption. The VM could have to be rebooted so an underlying hypervisor patch will be utilized or the bodily node swapped out for different {hardware}.
These passive healthchecks run continually within the background on all nodes. If a healthcheck begins failing, the node is routinely cordoned so no new pods are to be scheduled on the node. For extra critical healthcheck failures, we may also try a pod eviction to request all currently-running pods to exit instantly. It’s nonetheless as much as the pod itself, configurable by way of a Pod Disruption Funds, to resolve if it desires to permit this eviction to happen. Ultimately, both in spite of everything pods have terminated, or 7 days has elapsed (a part of our SLA), we’ll forcibly terminate the VM.
Energetic GPU assessments
Sadly not all GPU issues manifest as error codes seen by way of DCGM. We’ve constructed up our personal library of assessments that train GPUs to catch extra issues and be certain that the {hardware} and driver is behaving as anticipated. These assessments can’t be run within the background—they require unique use of a GPU for a number of seconds or minutes to run.
We first run these assessments on nodes upon boot, in a system we name “preflight.” All nodes be a part of the cluster with a “preflight” taint and label utilized. This taint prevents regular pods from being scheduled on the node. A DaemonSet is configured to run preflight take a look at pods on all nodes with this label. Upon profitable completion of the take a look at, the take a look at itself removes the taint and label and the node is then obtainable for basic use.
We additionally then run these assessments periodically through the lifetime of a node. We run this as a CronJob, permitting it to land on any obtainable node within the cluster. That is admittedly a bit random and uncontrolled about which nodes get examined, however we’ve discovered that over time it gives ample protection with minimal coordination or disruption.
Quotas & Useful resource Utilization
As we scaled up our clusters, researchers began to search out themselves having problem getting all the capability that they had been allotted. Conventional job scheduling programs have quite a lot of totally different options obtainable to pretty run work between competing groups, which Kubernetes doesn’t have. Over time, we took inspiration from these job scheduling programs and construct a number of capabilities in a Kubernetes-native approach.
Group taints
We’ve a service in every cluster, “team-resource-manager” that has a number of capabilities. Its knowledge supply is a ConfigMap that specifies tuples of (node selector, group label to use, allocation quantity) for all the analysis groups which have capability in a given cluster. It reconciles this with the present nodes within the cluster, tainting the suitable variety of nodes with
“team-resource-manager” additionally has an admission webhook service, such that as every job is submitted, a corresponding toleration is utilized based mostly on the submitter’s group membership. Utilizing taints permits us to constrain the Kubernetes pod scheduler flexibly, similar to permitting a “any” toleration for decrease precedence pods, which permits groups to borrow one another’s capability with out requiring heavyweight coordination.
CPU & GPU Balloons
Along with utilizing cluster-autoscaler to dynamically scale our VM-backed clusters, we use it to remediate (take away & re-add) unhealthy members throughout the cluster. We do that by setting the “min dimension” of the cluster to zero, and the “max dimension” of the cluster to the capability obtainable. Nonetheless, cluster-autoscaler, if it sees idle nodes, will try and scale right down to solely wanted capability. For a number of causes (VM spin up latency, pre-allocated prices, the API server impacts talked about above) this idle-scaling is not splendid.
So, we launched a balloon Deployment for each our CPU-only and GPU hosts. This Deployment accommodates a ReplicaSet with “max dimension” variety of low-priority pods. These pods occupy assets inside a node, so the autoscaler would not think about them as idle. Nonetheless since they’re low precedence, the scheduler can evict them instantly to make room for precise work. (We selected to make use of a Deployment as a substitute of a DaemonSet, to keep away from the DaemonSet being thought-about idle workload on a node.)
One factor of word, we use pod anti-affinity to make sure the pods would evenly distribute throughout the nodes. Earlier variations of the Kubernetes scheduler had an $O(N^2)$ efficiency problem with pod anti-affinity. This has been corrected since Kubernetes 1.18.
Gang Scheduling
Our experiments usually contain a number of StatefulSets, every working a unique portion of the coaching effort. For Optimizers, researchers want all members of the StatefulSet to be scheduled, earlier than any coaching will be executed (as we frequently use MPI to coordinate between optimizer members, and MPI is delicate to group membership modifications).
Nonetheless, Kubernetes by default will not essentially prioritize fulfilling all requests from one StatefulSet over one other. For instance if two experiments every requested 100% of the cluster’s capability, as a substitute of scheduling all of 1 experiment or the opposite, Kubernetes would possibly schedule solely half of every experiment’s pods, resulting in a impasse the place neither experiment could make progress.
We tried a couple of issues needing a customized scheduler, however bumped into edge circumstances that brought about conflicts with how regular pods had been scheduled. Kubernetes 1.18 launched a plugin structure for the core Kubernetes scheduler, making it a lot simpler so as to add options like this natively. We just lately landed on the Coscheduling plugin as a great way to resolve this downside.
Unsolved Issues
There are numerous issues nonetheless to handle as we scale up our Kubernetes clusters. A couple of of them embrace:
At our scale we’ve had many difficulties with Prometheus’s built-in TSDB storage engine being gradual to compact, and needing lengthy occasions wanted to replay the WAL (Write-Forward-Log) any time it restarts. Queries additionally are inclined to lead to “question processing would load too many samples” errors. We’re within the means of migrating to a unique Prometheus-compatible storage and question engine. Look ahead to a future weblog publish about the way it goes!
Pod Community Site visitors Shaping
As we scale up our clusters, every pod is calculated to have a specific amount of Web bandwidth obtainable. The mixture Web bandwidth necessities per individual have grow to be substantial, and our researchers now have the power to unintentionally put a major useful resource pressure on different areas on the Web, similar to datasets for obtain and software program packages to put in.
We’ve discovered Kubernetes to be an exceptionally versatile platform for our analysis wants. It has the power to scale as much as meet essentially the most demanding workloads we’ve placed on it. There are numerous areas but although the place it wants enchancment, and the Supercomputing group at OpenAI will proceed to discover how Kubernetes can scale. If this type of work appears fascinating, it’s best to think about making use of at OpenAI!




Please enter your comment!
Please enter your name here

Most Popular

Recent Comments