Kubernetes is inherently scalable. It has a number of tools that allow both applications as well as the infrastructure they are hosted on to scale in and out based on demand, efficiency and a number of other metrics.
In this article we will take a deep dive into these autoscaling tools and identify best practices for working with them. This list of best practices is targeted towards developers, DevOps and Kubernetes administrators tasked with application development and delivery on Kubernetes as well as managing and operating these applications once they are in production.
Download the Complete Checklist with Checks, Recipes and Best Practices for Resource Management, Security, Scalability and Monitoring for Production-Ready Kubernetes
Kubernetes has three scalability tools.Two of these, the Horizontal pod autoscaler (HPA) and the Vertical pod autoscaler (VPA), function on the application abstraction layer. The cluster autoscaler works on the infrastructure layer.
In this article we will outline best practices for all three auto scaling tools. Let’s start with the cluster autoscaler.
The cluster autoscaler is a Kubernetes tool that increases or decreases the size of a Kubernetes cluster (by adding or removing nodes), based on the presence of pending pods and node utilization metrics.
The cluster autocaler
The cluster autoscaler is a great tool to ensure that the underlying cluster infrastructure is elastic and scalable and can meet the changing demands of the workloads on top.
Let's now move on the cluster autoscaler best practices.
Kubernetes is a fast moving platform with new releases and features being released periodically. A best practice when deploying the cluster autoscaler is to ensure that you use it with the recommended Kubernetes version. Here is a complete list of the compatibility of different cluster autoscaler versions with Kubernetes versions.Ensure Cluster Nodes have the Same Capacity
The cluster autoscaler only functions correctly with kubernetes node groups/instance groups that have nodes with the same capacity. One reason for this is the underlying cluster autoscaler assumption that each individual node in the node group has the same CPU and memory capacity. Based on these assumptions, it creates template nodes for each node group and makes autoscaling decisions based on that template node.
A best practice, therefore is to ensure that the instance group being autoscaled via the cluster autoscaler has instances/nodes of the same type. For public cloud providers like AWS, this might not be optimal, since diversification and availability considerations dictate the use of multiple instance types. The cluster autoscaler does support node groups with mixed instance types. A best practice however, is to ensure that these instance types have the same resource footprint i.e. they have the same amount of CPU and memory resources.Ensure Every Pod has Resource Requests Defined
Since the cluster autoscaler makes scaling decisions based on the scheduling status of pods and the utilization of individual nodes, specifying resource requests is essential for it to function correctly.
Take cluster scale-down. The cluster autoscaler will scale down any nodes that have a utilization less than a specified threshold. Utilization is calculated as the sum of requested resources divided by the capacity. Utilization calculations could be thrown off by the presence of any pods or containers without resource requests and could lead to suboptimal functioning.
A best practice therefore is to ensure that all pods, scheduled to run in an autoscaled node group/instance group, have resource requests specified.Specify PodDisruptionBudget for kube-system Pods
Kube-system pods by default prevent the cluster autoscaler from scaling down the nodes they are running on. In situations where these pods end up on different nodes, they can also prevent a cluster from scaling down.
To avoid situations where nodes cannot be scaled down due to the presence of system pods, a best practice is to specify a pod disruption budget for these pods. Pod disruption budgets allows kubernetes administrators to avoid disruptions to critical pods and ensure that a desired number of these pods is always running.
While specifying a disruption budget for system pods it is important to consider the number of replicas of these pods that are provisioned by default.
Kube-dns is the only system pod that has multiple running replicas by default. Most other system pods run as single instance pods and restarting them could result in disruptions to the cluster.
A best practice in this context is to avoid building in a disruption budget for single instance pods like the metrics-server.Specify PodDisruptionBudget for Application Pods
In addition to specifying a pod disruption budget for system pods, another best practice is to also specify a pod disruption budget for application pods. This will ensure that the cluster autoscaler does not scale down pod replicas beyond a certain minimum number and will protect critical applications from disruptions and ensure high availability.
Pod disruption budgets can be specified using the
.spec.minAvailable specifies the number of pods that must be available after the eviction, as an absolute number or a percentage. Similarly
.spec.maxUnavailable sets out the maximum number of pods that can be unavailable after the eviction expressed either as an absolute number or a percentage.
For the cluster autoscaler to remain responsive it is important to ensure that the cluster does not exceed a certain size. The official scalability and responsiveness service level for the cluster autoscaler is set at 1000 nodes with each node running 30 pods. Here is a complete writeup of the scale up and scale down results using a test setup with a 1000 node cluster.
A best practice therefore is to avoid cluster sprawl and ensure that the cluster footprint does not exceed the specified scalability limit.Ensure Resource Availability for the Cluster Autoscaler Pod
For larger clusters it is important to ensure resource availability for the cluster autoscaler. A best practice in this context is to set resource requests of the cluster autoscaler pod to a minimum of 1 CPU.
It is also important to ensure that the node the cluster autoscaler pod is running on has enough resources available to support it. Running the cluster autoscaler pod on a node with resource pressure, could lead to degraded performance or the cluster autoscaler becoming non responsive.Ensure Resource Requests are Close to Actual Usage
As mentioned before the cluster autoscaler makes scaling decisions based on the presence of pending pods and the utilization of individual nodes. Node utilization is calculated as the sum of requested resources of all pods divided by the capacity.
However, most developers tend to over provision resource requests. This can at times lead to situations where pods are not utilizing requested resources efficiently, leading to a lower overall node utilization. However since the total resource requests are high the cluster autoscaler calculates a higher utilization level for the node and might not scale it down.
A best practice therefore is to ensure that pod's requested resources are comparable to the actual resource usage/consumption. Using the virtual pod autoscaler (VPA), is a good starting point. These decisions can also be based on historical resource usage and consumption of pods.Over-provision Cluster to Ensure head room for Critical pods
The cluster autoscaler has a service level objective (SLO) of 30 seconds latency between the time a pod is marked as unschedulable to the time that it requests a scale-up to the cloud provider. This latency benchmark is for smaller clusters of less than 100 nodes. For larger clusters of up to a 1000 nodes this latency is expected to be around the 60 second mark.
The actual time that it takes for the pod to be scheduled as a result of the scale up request and a new node being provisioned, depends on the cloud provider. This could very well mean a delay of several minutes.
To avoid this delay and ensure that pods spend as little time as possible in unschedulable state, a best practice is to over provision the cluster. This can be accomplished using a deployment running pause pods.
Pause pods are dummy pods that are spun up exclusively to reserve space for other higher priority pods. Since pause pods are assigned a very low priority, the kubernetes scheduler will remove them to make space for unscheduled pods with a higher priority. This essentially means that critical pods do not have to wait for a new node to be provisioned by the cloud provider and can be quickly scheduled on the already existing nodes, replacing the pause pods.
Once the pause pods re-spawn they become unschedulable resulting in the cluster scaling up. Cluster over provisioned head room can be controlled by specifying the size of the pause pods.
Let us now move on to the horizontal pod autoscaler (HPA).
HPA scales the number of pods in a replication controller, deployment, replica set or stateful set based on CPU utilization. HPA can also be configured to make scaling decisions based on custom or external metrics.
HPA is a great tool to ensure that critical applications are elastic and can scale out to meet increasing demand as well scale down to ensure optimal resource usage.Ensure all Pods have Resource Requests Configured
HPA makes scaling decisions based on the observed CPU utilisation values of pods that are part of a Kubernetes controller. Utilisation values are calculated as a percentage of the resource requests of individual pods. Missing resource request values for some containers might throw off the utilisation calculations of the HPA controller leading to suboptimal operation and scaling decisions.
A best practice therefore is to ensure that resource request values are configured for all containers of each individual pod, that is a part of the Kubernetes controller being scaled using HPA.Install metrics-server
HPA makes scaling decisions based on per-pod resource metrics retrieved from the resource metrics API (metrics.k8s.io). The metrics.k8s.io API is provided by the metrics-server. A best practice therefore is to launch metrics-server in your Kubernetes cluster as a cluster add-on.
In addition to this, another best practice is to set
true or unset. This is important since setting this flag to
false will revert to Heapster which is deprecated as of Kubernetes 1.11.
The HPA can also make scaling decisions based on custom or external metrics. There are two types of custom metrics supported: pod and object metrics. Pod metrics are averaged across all pods and as such only support
target type of
AverageValue. Object metrics can describe any other object in the same namespace and support
target types of both
A best practice when configuring custom metrics is to ensure that the correct
target type is used for pod and object metrics.
External metrics allow HPA to autoscale applications based on metrics provided by third party monitoring systems. External metrics support
target types of both
A best practice when deciding between custom and external metrics (when such a choice is possible) is to prefer custom metrics. One reason for this is the fact that the external metrics API takes a lot more effort to secure as compared to custom metrics API and could potentially allow access to all metrics.
The dynamic nature of the metrics being evaluated by the HPA may at times lead to scaling events in quick succession without a period between those scaling events. This leads to thrashing where the number of replicas fluctuates frequently and is not desirable.
To get around this and specify a cool down period a best practice is to configure the
--horizontal-pod-autoscaler-downscale-stabilization flag passed to the kube-controller-manager. This flag has a default value of 5 minutes and specifies the duration HPA waits after a downscale event before initiating another downscale operation.
Kubernetes admins should also take into account the unique requirements of their applications when deciding on an optimal value for this duration.
By default the HPA tolerates a 10% change in the desired to actual metrics ratio before scaling. Depending on application requirements, this value can be changed by configuring the
horizontal-pod-autoscaler-tolerance flag. Other configurable flags include --
horizontal-pod-autoscaler-initial-readiness-delay duration and
horizontal-pod-autoscaler-sync-period duration. All of these can be configured based on unique cluster or application requirements.
Next we will review best practices for the Vertical Pod Autoscaler (VPA). VPA automatically sets the resource request and limit values of containers based on usage. VPA aims to reduce the maintenance overhead of configuring resource requests and limits for containers and improve the utilization of cluster resources.
The VerticalPodAutoscaler can:
Version 0.4 and later of the VerticalPodAutoscaler requires custom resource definition capabilities and can therefore NOT be used with Kubernetes versions older than 1.11. For earlier Kubernetes versions it is recommended to use version 0.3 of the VerticalPodAutoscaler.
VPA makes scaling decisions based on usage and utilization metrics from both Prometheus and metrics-server.
Recommender is the main component of the VerticalPodAutoscaler and is responsible for computing recommended resources and generating a recommendation model. For running pods, the recommender component receives real-time usage and utilization metrics from the metrics-server via the metrics API and makes scaling decisions based on them. A best practice therefore is to ensure that the metrics-server is running in your Kubernetes cluster.
Unlike HPA however, VPA also requires prometheus. The history storage component of VPA, consumes utilization signals and OOM events and stores them persistently and is backed up by Prometheus. On startup the recommender fetches this data from the history storage and keeps it in memory.
For the recommender to pull in this historical data, a best practice is to install Prometheus in your cluster and configure it to receive metrics from cadvisor. Also ensure that metrics from cAdvisor have the label
Another best practice is to set the
--storage=prometheus and the
--prometheus-address=<your-prometheus-address> flags in the VerticalPodAutoscaler deployment:
This is what the spec looks like:
spec: containers: - args: - --v=4 - --storage=prometheus - --prometheus-address=http://prometheus.default.svc.cluster.local:9090
Also make sure you update the
--prometheus-address flag with the name of the actual namespace that Prometheus is running in.
HPA and VPA are currently incompatible and a best practice is to avoid using both together for the same set of pods. VPA can however be used with HPA that is configured to use either external or custom metrics.
A best practice when configuring VPA is to use it in combination with the cluster autoscaler. The recommender component of VPA might at times recommend resource request values that exceed the available resources. This leads to resource pressure and might result in some pods going into pending state. Having the cluster autoscaler running mitigates this behaviour since it spins up new nodes in response to pending pods.
Want to learn more? Download the Complete Best Practices Checklist with Checks, Recipes and Best Practices for Resource Management, Security, Scalability and Monitoring for Production-Ready Kubernetes
Fan of all things cloud, containers and micro-services!
A step by step walkthrough of deploying a highly available, reliable and resilient Kubernetes cluster leveraging AWS EC2 spot instances as worker nodes using both Kops and EKS.
October 9, 2019
13 min read
In this article, we will dive into Kubernetes best practices for CIOs and CTOs. It is based on our blog series outlining best practices for DevOps and Kubernetes admins and provides a broader more zoomed-out view of best practices in production.
August 13, 2019
13 min read