Kubernetes Right-sizing Basics

Background
Kubernetes cluster right-sizing
Kubernetes workload right-sizing
Third-party scaling solutions
Issues with right-sizing Kubernetes workloads from within Kubernetes
What’s a cluster administrator to do?
Additional Reading

In the earlier blog posts in this series, we discussed how Kubernetes resource management works. We looked at Kubernetes requests and limits and advance behavior and pod impact. Right-sizing Kubernetes workloads involves finding the right balance where you have configured CPU and Memory requests to be large enough to not compromise application SLA but not too large to create wasted resources and skyrocketing costs. Kubernetes provides several automated mechanisms to handle right sizing, and there are open source projects and commercial products available to do cluster and pod right sizing.

In this blog post, we’ll cover how Kubernetes supports cluster and pod right-sizing. We will also identify some of the issues with in-cluster-controller-based model of pod right sizing that currently dominates the field of Kubernetes FinOps.

Background

Since the beginning, Kubernetes supported horizontal scaling of both clusters (by adding or removing nodes) and workloads (by changing the number of replicas of a pod to be maintained by controller automation). The desire to automate these changes themselves resulted in the creation of the earliest autoscaling component, the HorizontalPodAutoscaler, which first appeared in Kubernetes 1.1. The ClusterAutoScaler add-on showed up not much later, in Kubernetes 1.3. These continue to be the main mechanisms that are used to scale pods, but more recently, the need to automatically alter the actual definitions of pods themselves brought the VerticalPodAutoscaler, as well as a variety of third-party tools like Karpenter, KEDA, and a host of others.

Kubernetes cluster right-sizing

Horizontally scaling clusters

Horizontally scaling clusters is easy — a new node is provisioned and added to the cluster, or a node is cordoned, drained, removed from the cluster, and terminated. In the past this was automated by using mechanisms like an AWS Auto-Scaling Group, configured with a boot-time configuration that would handle joining the node to the cluster. This pattern remains even with newer tools like CAS and those from third-party projects and vendors. Note, though, that while scale-out is generally non-disruptive, scaling-in a cluster to reduce fragmentation and optimize Kubernetes costs usually involves pods being terminated because Kubernetes does not support live migration of pods from one worker node to another.

Vertically scaling clusters

Vertical scaling is easy in theory too, if your infrastructure supports adding to or removing from a particular resource on a node without restarting it. Kubernetes will just start recognizing the new amount of that resource. However, vertical scale-down of nodes can be challenging if the reason you want to scale down is low resource utilization, because those unconsumed resources may actually be allocated by Kubernetes as part of container requests. Also, many infrastructure providers don’t support any kind of live resizing, so changing node resources will require a shutdown and restart at minimum – possibly a full node replacement.

Kubernetes workload right-sizing

Kubernetes has mechanisms to scale workloads both horizontally and vertically, though not all with the same level of support or ease of use.

Horizontal pod scaling

Horizontal pod scaling is natively supported in Kubernetes with a resource built on Kubernetes Deployments, called a HorizontalPodAutoscaler. HPA works by changing the number of replicas a Deployment targets in response to changes in metrics like CPU usage, or optionally in response to custom metrics (more on custom metrics later).

Vertical workload scaling or pod right-sizing

There are two methods of vertical pod scaling supported by Kubernetes, though only one is stable:

Resizing by restarting pods

In-place resizing of pods

Restarting pods is how Kubernetes has handled most updates to a pod spec since the beginning, because some aspects of a pod are defined to be immutable. Changes must be implemented by terminating an old pod and creating a new one to replace it. Resource requests are one such aspect — but this means that if an administrator or Kubernetes controller adjusts resource requests up or down, applications incur pod disruption, and some applications don’t tolerate this well.

Starting with Kubernetes 1.27, a feature was added for in-place resizing of container resources in a pod. However, you currently have to enable an option at cluster install time to use this, because it’s still considered alpha-status in the current Kubernetes version as of this writing (1.29). It also has the limitation that you can’t change the QoS class of a pod.

See the second blog post in this series for details on pod QoS classes.

To automate these methods of resizing pods, the Vertical Pod Autoscaler is an add-on provided by Kubernetes that targets Deployments for resizing.

Currently, VPA only supports restart-based resizing. Adding support for the new in-place resizing ability is in the proposal stage.

Because the HPA and VPA are separate components with no awareness of each other, they can cause undesirable results if they both target the same resource for autoscaling. A proposal has been made to combine the capabilities of both into a “multidimensional pod autoscaler” that would handle both types in an integrated way, so be aware that how pod autoscaling is defined and implemented could change significantly in the future.

Third-party scaling solutions

A growing number of projects and products provide solutions for workload or cluster scaling, such as KEDA, Karpenter, Goldilocks, and others. In general, these fall into one of two categories:

A controller runs in the cluster and makes changes or recommendations
An agent runs in the cluster gathering data, and reporting to an external SaaS console that makes changes or recommendations

Issues with right-sizing Kubernetes workloads from within Kubernetes

The whole idea of automating scaling is a complex one, but more importantly, there are some inherent concerns with the dominant model of controlling scaling changes using tools run inside the cluster itself, regardless of which specific tools are in use:

If the in-cluster components are directly interacting with authenticated APIs of external services like AWS EC2, they need to safely handle a credential for that API, which means you also need to provide it securely. This is far from impossible to do in a Kubernetes environment, but many “solutions” to that issue merely add a layer of abstraction (for example, having the API credential held by a cluster-external service and provided on-demand moves the problem around, but doesn’t actually solve it because now you have to securely authenticate your workload to that external service). Infrastructure credential exfiltration or inadvertent leakage can cost you many times what your cluster itself did.
Having a cluster management component or agent deployed in the cluster it’s intended to manage makes the component or agent itself a Kubernetes workload – which means it can now be affected by things like an over-utilized cluster, node failures, networking issues, etc. These issues can be mitigated by careful planning, but will always be present to some degree.
Poorly-defined autoscaling policies or interactions between autoscalers can lead to cost inflation rather than reduction – for example, if vertical workload autoscaling causes a pod’s resources to be reduced to a point where a horizontal scaler activates based on “high pod CPU utilization”, additional nodes could be provisioned to handle a phantom “load spike” when the actual application load hasn’t changed at all.
Some operations are inherently disruptive: there is no way to move a pod from one node to another without terminating it – and not all workloads operate well with arbitrary disruption like that; this is in fact why Kubernetes introduced the concept of PodDisruptionBudget. As a result, scalers can get caught on the horns of a dilemma: it’s not possible to increase efficiency in some desired way without terminating a pod, but that application’s PodDisruptionBudget requires that pod not be terminated.
Some of them also suffer (through no fault of their own, really) from unrealistic expectations – a tool that is only a cluster autoscaler will never make your workloads more efficient, it can only make sure that your cluster has the right amount of resources to run all your inefficiently-configured workloads. It’s up to a cluster administrator, of course, to make sure the tool they adopt for a particular job is in fact intended to and capable of doing it.)

In general, all these tools build on the same capabilities – and have the same limitations – of the underlying infrastructure and the Kubernetes architecture itself. Essentially, none of them is doing anything a cluster admin couldn’t do with kubectl and infrastructure provisioning tools – they just do it a lot faster due to automation. That doesn’t mean you shouldn’t use them, but it does mean you should carefully consider your needs and which of them are fulfilled or impacted by a given tool.

However, there are limits to how far resource efficiency can be optimized with this approach without giving up reliability. We see this routinely demonstrated in data: despite the plethora of cluster and workload scaling utilities and their increasing adoption, typical cluster utilization in one survey after another remains shockingly low, generally 30% at most – often much less; that unused capacity translates directly into wasted budget. It’s like trying to ship small goods packaged into individual bags or boxes and loading those directly onto ships, planes and trains, repackaging them if necessary first – it can work in theory but past a certain point, it doesn’t scale in the real world and you need a solution that lets you consolidate things and manage them as a whole.

What’s a cluster administrator to do?

In addition to in-cluster management of individual workloads, you need utilization management operating at a layer of abstraction below that of the cluster infrastructure – and at Platform9 we have just the thing to meet that need in Elastic Machine Pool. In our next blog post, we’ll dive into the details of how EMP uses production-proven technology for lower-level resource management to achieve additional cost improvements in harmony with your existing tools for in-cluster FinOps automation.