[Kubernetes]

Everything you Need to Know about Kubernetes Quality of Service (QoS) Classes

In this blog post we will take a deep dive into Kubernetes QOS classes. We will start by looking at the factors that determine whether a pod is assigned a Guaranteed, Burstable or BestEffort QOS class. We will then look at how QOS class impacts the way pods are scheduled by the Kubernetes Scheduler, how it impacts the eviction order of pods by the Kubelet as well as what happens to them during node OOM events.

Hasham Haider

Hasham Haider

March 22, 2019

6 minute read

Quality of Service (QoS) class is a Kubernetes concept which determines the scheduling and eviction priority of pods. QoS class is used by the Kubernetes scheduler to make decisions about scheduling pods onto nodes.

Kubelet uses it to govern both the order in which pods are evicted as well as to allow more complex pod placement decisions using advanced CPU management policies.

QoS class is assigned to pods by Kubernetes itself. DevOps can, however, control the QoS class assigned to a pod by playing around with resource requests and limits for individual containers inside the pod.

There are three QoS classes in Kubernetes:

  • Guaranteed
  • Burstable
  • BestEffort

Let’s go through the different QoS classes and see how they work together with the Kubernetes Scheduler and Kubelet.

Guaranteed

How Can I Assign a QoS class of Guaranteed to a Pod?

For a pod to be placed in the Guaranteed QoS class, every container in the pod must have a CPU and memory limit. Kubernetes will automatically assign CPU and memory request values (that are equal to the CPU and memory limit values) to the containers inside this pod and will assign it the Guaranteed QoS class.

Pods with explicit and equal values for both CPU requests and limits and memory requests and limits are also placed in the Guaranteed QoS class.

How does the Kubernetes Scheduler Handle Guaranteed Pods?

The Kubernetes scheduler assigns Guaranteed pods only to nodes which have enough resources to fulfil their CPU and memory requests. The Scheduler does this by ensuring that the sum of both memory and CPU requests for all containers (running and newly scheduled) is lower than the total capacity of the node.

Can Guaranteed Pods be Allocated Exclusive CPU Cores?

The default CPU management policy of Kubernetes is “None”. Under this policy Guaranteed pods run in the shared CPU pool on a node. The shared CPU pool contains all the CPU resources on the node minus the ones reserved by the Kubelet using --kube-reserved or --system-reserved.

Guaranteed pods can, however, be allocated exclusive use of CPU cores with a static CPU management policy. To be granted exclusive use of CPU cores under this policy, Guaranteed pods also need to have CPU request values in integers. Guaranteed pods with fractional CPU request values will still run in the shared CPU pool under the static CPU management policy.

What Stops a Guaranteed Pod from being Scheduled onto a Node?

Guaranteed pods cannot be scheduled onto nodes for which the Kubelet reports a DiskPressure node condition. DiskPressure is a node condition which is triggered when the available disk space and inodes on either the node’s root filesystem or image filesystem hit an eviction threshold. When the node reports a DiskPressure condition, the Scheduler stops scheduling any new Guaranteed pods onto the node.

Burstable

How Can I assign a QoS class of Burstable to a Pod?

A pod is assigned a Burstable QoS class if at least one container in that pod has a memory or CPU request.

How does the Kubernetes Scheduler Handle Burstable Pods?

The Kubernetes scheduler will not be able to ensure that Burstable pods are placed onto nodes that have enough resources for them.

Can Burstable Pods be Allocated Exclusive CPU cores?

Burstable pods run in the shared resources pool of nodes along with BestEffort and Guaranteed pods under the default “None” CPU management policy. It is not possible to allocate exclusive CPU cores to Burstable pods.

What Stops a Burstable Pod from Being Scheduled onto a Node?

As with Guaranteed pods, BestEffort pods also cannot be scheduled onto nodes under DiskPressure. The Kubernetes scheduler will not schedule any new Burstable pods onto a node with the condition DiskPressure.

BestEffort

How can I Assign a QoS Class of BestEffort to a Pod?

A pod is assigned a BestEffort QoS class if none of it’s containers have CPU or memory requests and limits.

How does the Kubernetes Scheduler handle BestEffort pods?

BestEffort pods are not guaranteed to be placed on to pods that have enough resources for them. They are, however, able to use any amount of free CPU and memory resources on the node. This can at times lead to resource contention with other pods, where BestEffort pods hog resources and do not leave enough resource headroom for other pods to consume resources within resource limits.

Can BestEffort Pods be Allocated Exclusive CPU cores?

As with pods which have a Burstable QoS class, BestEffort pods also run in the shared resources pool on a node and cannot be granted exclusive CPU resource usage.

What Stops a BestEffort Pod from Being Scheduled onto a Node?

BestEffort pods cannot be scheduled onto nodes under both DiskPressure and MemoryPressure. A node reports MemoryPressure condition if it has lower levels of memory available then a predefined threshold. The Kubernetes Scheduler will, in turn, stop scheduling any new BestEffort pods onto the node.

Next we will look at how Kubelet handles evictions for pods of all three QoS classes. We will also see how a pod's QOS class impacts what happens to it when the node runs out of memory.

How are Guaranteed, Burstable and BestEffort Pods Evicted by the Kubelet?

Pod evictions are initiated by the Kubelet when the node starts running low on compute resources. These evictions are meant to reclaim resources to avoid a system out of memory (OOM) event. DevOps can specify thresholds for resources which when breached trigger pod evictions by the Kubelet.

The QoS class of a pod does affect the order in which it is chosen for eviction by the Kubelet. Kubelet first evicts BestEffort and Burstable pods using resources above requests. The order of eviction depends on the priority assigned to each pod and the amount of resources being consumed above request.

Guaranteed and Burstable pods not exceeding resource requests are evicted next based on which ones have the lowest priority.

Both Guaranteed and Burstable pods whose resource usage is lower than the requested amount are never evicted because of the resource usage of another pod. They might, however, be evicted if system daemons start using more resources than reserved. In this case, Guaranteed and Burstable pods with the lowest priority are evicted first.

When responding to DiskPressure node condition, the Kubelet first evicts BestEffort pods followed by Burstable pods. Only when there are no BestEffort or Burstable pods left are Guaranteed pods evicted.

What is the Node Out of Memory (OOM) Behaviour for Guaranteed, Burstable and BestEffort pods?

If the node runs out of memory before the Kubelet can reclaim it, the oom_killer kicks in to kill containers based on their oom_score. The oom_score is calculated by the oom_killer for each container and is based on the percentage of memory the container uses on the node as compared to what it requested plus the oom_score_adj score.

The oom_score_adj for each container is governed by the QoS class of the pod it belongs to. For a container inside a Guaranteed pod the oom_score_adj is “-998”, for a Burstable pod container it is “1000” and for a BestEffort pod container “min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)”.

The oom_killer first terminates containers that belong to pods with the lowest QoS class and which most exceed the requested resources. This means that containers belonging to a pod with a better QoS class (like Guaranteed) have a lower probability of being killed than one’s with Burstable or BestEffort QoS class.

This, however, is not true of all cases. Since the oom_killer also considers memory usage vs request, a container with a better QoS class might have a higher oom_score because of excessive memory usage and thus might be killed first.

Conclusion

QOS class determines the order in which the Kubernetes scheduler schedules pods as well as the order in which they are evicted by the Kubelet. DevOps can influence the QOS class of a pod by assigning resource limits and/or requests to individual containers belonging to the pod.

The QOS class of a pod can in some cases impact the resource utilization of individual nodes. Since resource requests and limits are mostly set based on guesstimates there are cases where Guaranteed and Burstable QOS pods have a resource footprint that is much higher than required. This can lead to sitautions where pods do not utilize the requested resources efficiently.

Replex’s Kubernetes solution analyzes historical resource requests, usage and utilization to provide actionable optimization recommendations to Kubernetes administrators and IT managers. IT managers can then right size infrastructure as well as optimize the resource footprint of individual pods, resulting in significant cost savings.

logos of docker, mesos, and kubernetes and a green button

Hasham Haider

Author

Hasham Haider

Fan of all things cloud, containers and micro-services!

Want to Dig Deeper and Understand How Different Teams or Applications are Driving Your Costs?

Request a quick 20 minute demo to see how you can seamlessly allocate Kubernetes costs while saving up to 30% on infrastructure costs using Replex.

Schedule a Meeting