[Kubernetes in Production, Kubernetes Best Practices]

Kubernetes Autoscaling in Production: Best Practices for Cluster Autoscaler, HPA and VPA

In this article we will take a deep dive into Kubernetes autoscaling tools including the cluster autoscaler, the horizontal pod autoscaler and the vertical pod autoscaler. We will also identify best practices that developers, DevOps and Kubernetes administrators should follow when configuring these tools.

Hasham Haider

Hasham Haider

December 5, 2019

13 minute read

Kubernetes is inherently scalable. It has a number of tools that allow both applications as well as the infrastructure they are hosted on to scale in and out based on demand, efficiency and a number of other metrics.

In this article we will take a deep dive into these autoscaling tools and identify best practices for working with them. This list of best practices is targeted towards developers, DevOps and Kubernetes administrators tasked with application development and delivery on Kubernetes as well as managing and operating these applications once they are in production.

Download the Complete Checklist with Checks, Recipes and Best Practices for Resource Management, Security, Scalability and Monitoring for Production-Ready Kubernetes

Download Checklist

Kubernetes Autoscaling

Kubernetes has three scalability tools.Two of these, the Horizontal pod autoscaler (HPA) and the Vertical pod autoscaler (VPA), function on the application abstraction layer. The cluster autoscaler works on the infrastructure layer.

In this article we will outline best practices for all three auto scaling tools. Let’s start with the cluster autoscaler.

What is the Cluster Autoscaler?

The cluster autoscaler is a Kubernetes tool that increases or decreases the size of a Kubernetes cluster (by adding or removing nodes), based on the presence of pending pods and node utilization metrics.

The cluster autocaler

  • Adds nodes to a cluster whenever it detects pending pods that could not be scheduled due to resource shortages.
  • Removes nodes from a cluster, whenever the utilization of a node falls below a certain threshold defined by the cluster administrator.

The cluster autoscaler is a great tool to ensure that the underlying cluster infrastructure is elastic and scalable and can meet the changing demands of the workloads on top.

Let's now move on the cluster autoscaler best practices.

Cluster Autoscaler Best Practices

Use the Correct Version of Cluster Autoscaler

Kubernetes is a fast moving platform with new releases and features being released periodically. A best practice when deploying the cluster autoscaler is to ensure that you use it with the recommended Kubernetes version. Here is a complete list of the compatibility of different cluster autoscaler versions with Kubernetes versions.

Ensure Cluster Nodes have the Same Capacity

The cluster autoscaler only functions correctly with kubernetes node groups/instance groups that have nodes with the same capacity. One reason for this is the underlying cluster autoscaler assumption that each individual node in the node group has the same CPU and memory capacity. Based on these assumptions, it creates template nodes for each node group and makes autoscaling decisions based on that template node.

A best practice, therefore is to ensure that the instance group being autoscaled via the cluster autoscaler has instances/nodes of the same type. For public cloud providers like AWS, this might not be optimal, since diversification and availability considerations dictate the use of multiple instance types. The cluster autoscaler does support node groups with mixed instance types. A best practice however, is to ensure that these instance types have the same resource footprint i.e. they have the same amount of CPU and memory resources.

Ensure Every Pod has Resource Requests Defined

Since the cluster autoscaler makes scaling decisions based on the scheduling status of pods and the utilization of individual nodes, specifying resource requests is essential for it to function correctly.

Take cluster scale-down. The cluster autoscaler will scale down any nodes that have a utilization less than a specified threshold. Utilization is calculated as the sum of requested resources divided by the capacity. Utilization calculations could be thrown off by the presence of any pods or containers without resource requests and could lead to suboptimal functioning.

A best practice therefore is to ensure that all pods, scheduled to run in an autoscaled node group/instance group, have resource requests specified.

Specify PodDisruptionBudget for kube-system Pods

Kube-system pods by default prevent the cluster autoscaler from scaling down the nodes they are running on. In situations where these pods end up on different nodes, they can also prevent a cluster from scaling down.

To avoid situations where nodes cannot be scaled down due to the presence of system pods, a best practice is to specify a pod disruption budget for these pods. Pod disruption budgets allows kubernetes administrators to avoid disruptions to critical pods and ensure that a desired number of these pods is always running.

While specifying a disruption budget for system pods it is important to consider the number of replicas of these pods that are provisioned by default.

Kube-dns is the only system pod that has multiple running replicas by default. Most other system pods run as single instance pods and restarting them could result in disruptions to the cluster.

A best practice in this context is to avoid building in a disruption budget for single instance pods like the metrics-server.

Specify PodDisruptionBudget for Application Pods

In addition to specifying a pod disruption budget for system pods, another best practice is to also specify a pod disruption budget for application pods. This will ensure that the cluster autoscaler does not scale down pod replicas beyond a certain minimum number and will protect critical applications from disruptions and ensure high availability.

Pod disruption budgets can be specified using the .spec.minAvailable and .spec.maxUnavailable fields. .spec.minAvailable specifies the number of pods that must be available after the eviction, as an absolute number or a percentage. Similarly .spec.maxUnavailable sets out the maximum number of pods that can be unavailable after the eviction expressed either as an absolute number or a percentage.

Avoid using the Cluster Autoscaler with more than 1000 Node Clusters

For the cluster autoscaler to remain responsive it is important to ensure that the cluster does not exceed a certain size. The official scalability and responsiveness service level for the cluster autoscaler is set at 1000 nodes with each node running 30 pods. Here is a complete writeup of the scale up and scale down results using a test setup with a 1000 node cluster.

A best practice therefore is to avoid cluster sprawl and ensure that the cluster footprint does not exceed the specified scalability limit.

Ensure Resource Availability for the Cluster Autoscaler Pod

For larger clusters it is important to ensure resource availability for the cluster autoscaler. A best practice in this context is to set resource requests of the cluster autoscaler pod to a minimum of 1 CPU.

It is also important to ensure that the node the cluster autoscaler pod is running on has enough resources available to support it. Running the cluster autoscaler pod on a node with resource pressure, could lead to degraded performance or the cluster autoscaler becoming non responsive.

Ensure Resource Requests are Close to Actual Usage

As mentioned before the cluster autoscaler makes scaling decisions based on the presence of pending pods and the utilization of individual nodes. Node utilization is calculated as the sum of requested resources of all pods divided by the capacity.

However, most developers tend to over provision resource requests. This can at times lead to situations where pods are not utilizing requested resources efficiently, leading to a lower overall node utilization. However since the total resource requests are high the cluster autoscaler calculates a higher utilization level for the node and might not scale it down.

A best practice therefore is to ensure that pod's requested resources are comparable to the actual resource usage/consumption. Using the virtual pod autoscaler (VPA), is a good starting point. These decisions can also be based on historical resource usage and consumption of pods.

Over-provision Cluster to Ensure head room for Critical pods

The cluster autoscaler has a service level objective (SLO) of 30 seconds latency between the time a pod is marked as unschedulable to the time that it requests a scale-up to the cloud provider. This latency benchmark is for smaller clusters of less than 100 nodes. For larger clusters of up to a 1000 nodes this latency is expected to be around the 60 second mark.

The actual time that it takes for the pod to be scheduled as a result of the scale up request and a new node being provisioned, depends on the cloud provider. This could very well mean a delay of several minutes.

To avoid this delay and ensure that pods spend as little time as possible in unschedulable state, a best practice is to over provision the cluster. This can be accomplished using a deployment running pause pods.

Pause pods are dummy pods that are spun up exclusively to reserve space for other higher priority pods. Since pause pods are assigned a very low priority, the kubernetes scheduler will remove them to make space for unscheduled pods with a higher priority. This essentially means that critical pods do not have to wait for a new node to be provisioned by the cloud provider and can be quickly scheduled on the already existing nodes, replacing the pause pods.

Once the pause pods re-spawn they become unschedulable resulting in the cluster scaling up. Cluster over provisioned head room can be controlled by specifying the size of the pause pods.

To recap here are the recommended best practices for the cluster autoscaler on Kubernetes:

  • Use the correct version of Cluster autoscaler
  • Ensure cluster nodes have the same capacity:
  • Ensure every pod has resource requests defined
  • Specify PodDisruptionBudget for kube-system pods
  • Specify PodDisruptionBudget for application pods
  • Avoid using the Cluster autoscaler with more than 1000 node clusters
  • Ensure resource availability for the cluster autoscaler pod
  • Ensure resource requests are close to actual usage
  • Over-provision cluster to ensure head room for critical pods

Let us now move on to the horizontal pod autoscaler (HPA).

HorizontalPodAutoscaler (HPA) Best Practices

HPA scales the number of pods in a replication controller, deployment, replica set or stateful set based on CPU utilization. HPA can also be configured to make scaling decisions based on custom or external metrics. 

HPA is a great tool to ensure that critical applications are elastic and can scale out to meet increasing demand as well scale down to ensure optimal resource usage

Ensure all Pods have Resource Requests Configured

HPA makes scaling decisions based on the observed CPU utilisation values of pods that are part of a Kubernetes controller. Utilisation values are calculated as a percentage of the resource requests of individual pods. Missing resource request values for some containers might throw off the utilisation calculations of the HPA controller leading to suboptimal operation and scaling decisions. 

A best practice therefore is to ensure that resource request values are configured for all containers of each individual pod, that is a part of the Kubernetes controller being scaled using HPA.

Install metrics-server

HPA makes scaling decisions based on per-pod resource metrics retrieved from the resource metrics API (metrics.k8s.io). The metrics.k8s.io API is provided by the metrics-server. A best practice therefore is to launch metrics-server in your Kubernetes cluster as a cluster add-on. 

In addition to this, another best practice is to set --horizontal-pod-autoscaler-use-rest-clients to true or unset. This is important since setting this flag to false will revert to Heapster which is deprecated as of Kubernetes 1.11. 

Configure Custom or External Metrics 

The HPA can also make scaling decisions based on custom or external metrics. There are two types of custom metrics supported: pod and object metrics. Pod metrics are averaged across all pods and as such only support target type of AverageValue. Object metrics can describe any other object in the same namespace and support target types of both Value and AverageValue

A best practice when configuring custom metrics is to ensure that the correct target type is used for pod and object metrics. 

External metrics allow HPA to autoscale applications based on metrics provided by third party monitoring systems. External metrics support target types of both Value and AverageValue

Prefer Custom Metrics over External Metrics whenever Possible

A best practice when deciding between custom and external metrics (when such a choice is possible) is to prefer custom metrics. One reason for this is the fact that the external metrics API takes a lot more effort to secure as compared to custom metrics API and could potentially allow access to all metrics. 

Configure Cooldown Period

The dynamic nature of the metrics being evaluated by the HPA may at times lead to scaling events in quick succession without a period between those scaling events. This leads to thrashing where the number of replicas fluctuates frequently and is not desirable. 

To get around this and specify a cool down period a best practice is to configure the --horizontal-pod-autoscaler-downscale-stabilization flag passed to the kube-controller-manager. This flag has a default value of 5 minutes and specifies the duration HPA waits after a downscale event before initiating another downscale operation.

Kubernetes admins should also take into account the unique requirements of their applications when deciding on an optimal value for this duration.

By default the HPA tolerates a 10% change in the desired to actual metrics ratio before scaling. Depending on application requirements, this value can be changed by configuring the horizontal-pod-autoscaler-tolerance flag. Other configurable flags include --horizontal-pod-autoscaler-cpu-initialization-period duration,  horizontal-pod-autoscaler-initial-readiness-delay duration and horizontal-pod-autoscaler-sync-period duration. All of these can be configured based on unique cluster or application requirements. 

To recap here are the Horizontal Pod Autoscaler (HPA) best practices. 

  • Ensure all pods have resource requests specified
  • Install metrics-server
  • Configure custom or external metrics
  • Prefer Custom metrics over external metrics
  • Configure cool-down period

Vertical Pod Autoscaler (VPA) Best Practices

Next we will review best practices for the Vertical Pod Autoscaler (VPA). VPA automatically sets the resource request and limit values of containers based on usage. VPA aims to reduce the maintenance overhead of configuring resource requests and limits for containers and improve the utilization of cluster resources.

The VerticalPodAutoscaler can: 

  • Reduce the request value for containers whose resource usage is consistently lower than the requested amount.
  • Increase request values for containers that consistently use a high percentage of resources requested. 
  • Automatically set resource limit values based on limit to request ratios specified as part of the container template. 

Use the Correct Kubernetes Version

Version 0.4 and later of the VerticalPodAutoscaler requires custom resource definition capabilities and can therefore NOT be used with Kubernetes versions older than 1.11. For earlier Kubernetes versions it is recommended to use version 0.3 of the VerticalPodAutoscaler.

Install metrics-server and Prometheus

VPA makes scaling decisions based on usage and utilization metrics from both Prometheus and metrics-server. 

Recommender is the main component of the VerticalPodAutoscaler and is responsible for computing recommended resources and generating a recommendation model. For running pods, the recommender component receives real-time usage and utilization metrics from the metrics-server via the metrics API and makes scaling decisions based on them. A best practice therefore is to ensure that the metrics-server is running in your Kubernetes cluster. 

Unlike HPA however, VPA also requires prometheus. The history storage component of VPA, consumes utilization signals and OOM events and stores them persistently and is backed up by Prometheus. On startup the recommender fetches this data from the history storage and keeps it in memory. 

For the recommender to pull in this historical data, a best practice is to install Prometheus in your cluster and configure it to receive metrics from cadvisor. Also ensure that metrics from cAdvisor have the label job=kubernetes-cadvisor.

Another best practice is to set the --storage=prometheus and the --prometheus-address=<your-prometheus-address> flags in the VerticalPodAutoscaler deployment:

This is what the spec looks like:

spec:
  containers:
  - args:
    - --v=4
    - --storage=prometheus
    - --prometheus-address=http://prometheus.default.svc.cluster.local:9090

Also make sure you update the --prometheus-address flag with the name of the actual namespace that Prometheus is running in.

Avoid using HPA and VPA in tandem

HPA and VPA are currently incompatible and a best practice is to avoid using both together for the same set of pods. VPA can however be used with HPA that is configured to use either external or custom metrics. 

Use VPA together with Cluster autoscaler

A best practice when configuring VPA is to use it in combination with the cluster autoscaler. The recommender component of VPA might at times recommend resource request values that exceed the available resources. This leads to resource pressure and might result in some pods going into pending state. Having the cluster autoscaler running mitigates this behaviour since it spins up new nodes in response to pending pods. 

To recap here are the recommended best practices for VPA:

  • Use the Correct Kubernetes Version
  • Install metrics-server and Prometheus
  • Avoid using HPA and VPA in tandem
  • Use VPA together with Cluster autoscaler

Want to learn more? Download the Complete Best Practices Checklist with Checks, Recipes and Best Practices for Resource Management, Security, Scalability and Monitoring for Production-Ready Kubernetes

Download Checklist

 

Kubernetes Production Readiness and Best Practices Checklist Kubernetes Production Readiness and Best Practices Checklist Cover Download Checklist
Hasham Haider

Author

Hasham Haider

Fan of all things cloud, containers and micro-services!

Want to Dig Deeper and Understand How Different Teams or Applications are Driving Your Costs?

Request a quick 20 minute demo to see how you can seamlessly allocate Kubernetes costs while saving up to 30% on infrastructure costs using Replex.

Schedule a Meeting