Quality of Service (QoS) class is a Kubernetes concept which determines the scheduling and eviction priority of pods. QoS class is used by the Kubernetes scheduler to make decisions about scheduling pods onto nodes.
Kubelet uses it to govern both the order in which pods are evicted as well as to allow more complex pod placement decisions using advanced CPU management policies.
QoS class is assigned to pods by Kubernetes itself. DevOps can, however, control the QoS class assigned to a pod by playing around with resource requests and limits for individual containers inside the pod.
There are three QoS classes in Kubernetes:
Let’s go through the different QoS classes and see how they work together with the Kubernetes Scheduler and Kubelet.
For a pod to be placed in the Guaranteed QoS class, every container in the pod must have a CPU and memory limit. Kubernetes will automatically assign CPU and memory request values (that are equal to the CPU and memory limit values) to the containers inside this pod and will assign it the Guaranteed QoS class.
Pods with explicit and equal values for both CPU requests and limits and memory requests and limits are also placed in the Guaranteed QoS class.
The Kubernetes scheduler assigns Guaranteed pods only to nodes which have enough resources to fulfil their CPU and memory requests. The Scheduler does this by ensuring that the sum of both memory and CPU requests for all containers (running and newly scheduled) is lower than the total capacity of the node.
The default CPU management policy of Kubernetes is “None”. Under this policy Guaranteed pods run in the shared CPU pool on a node. The shared CPU pool contains all the CPU resources on the node minus the ones reserved by the Kubelet using --kube-reserved or --system-reserved.
Guaranteed pods can, however, be allocated exclusive use of CPU cores with a static CPU management policy. To be granted exclusive use of CPU cores under this policy, Guaranteed pods also need to have CPU request values in integers. Guaranteed pods with fractional CPU request values will still run in the shared CPU pool under the static CPU management policy.
Guaranteed pods cannot be scheduled onto nodes for which the Kubelet reports a DiskPressure node condition. DiskPressure is a node condition which is triggered when the available disk space and inodes on either the node’s root filesystem or image filesystem hit an eviction threshold. When the node reports a DiskPressure condition, the Scheduler stops scheduling any new Guaranteed pods onto the node.
A pod is assigned a Burstable QoS class if at least one container in that pod has a memory or CPU request.
The Kubernetes scheduler will not be able to ensure that Burstable pods are placed onto nodes that have enough resources for them.
Burstable pods run in the shared resources pool of nodes along with BestEffort and Guaranteed pods under the default “None” CPU management policy. It is not possible to allocate exclusive CPU cores to Burstable pods.
As with Guaranteed pods, BestEffort pods also cannot be scheduled onto nodes under DiskPressure. The Kubernetes scheduler will not schedule any new Burstable pods onto a node with the condition DiskPressure.
A pod is assigned a BestEffort QoS class if none of it’s containers have CPU or memory requests and limits.
BestEffort pods are not guaranteed to be placed on to pods that have enough resources for them. They are, however, able to use any amount of free CPU and memory resources on the node. This can at times lead to resource contention with other pods, where BestEffort pods hog resources and do not leave enough resource headroom for other pods to consume resources within resource limits.
As with pods which have a Burstable QoS class, BestEffort pods also run in the shared resources pool on a node and cannot be granted exclusive CPU resource usage.
BestEffort pods cannot be scheduled onto nodes under both DiskPressure and MemoryPressure. A node reports MemoryPressure condition if it has lower levels of memory available then a predefined threshold. The Kubernetes Scheduler will, in turn, stop scheduling any new BestEffort pods onto the node.
Next we will look at how Kubelet handles evictions for pods of all three QoS classes. We will also see how a pod's QOS class impacts what happens to it when the node runs out of memory.
Pod evictions are initiated by the Kubelet when the node starts running low on compute resources. These evictions are meant to reclaim resources to avoid a system out of memory (OOM) event. DevOps can specify thresholds for resources which when breached trigger pod evictions by the Kubelet.
The QoS class of a pod does affect the order in which it is chosen for eviction by the Kubelet. Kubelet first evicts BestEffort and Burstable pods using resources above requests. The order of eviction depends on the priority assigned to each pod and the amount of resources being consumed above request.
Guaranteed and Burstable pods not exceeding resource requests are evicted next based on which ones have the lowest priority.
Both Guaranteed and Burstable pods whose resource usage is lower than the requested amount are never evicted because of the resource usage of another pod. They might, however, be evicted if system daemons start using more resources than reserved. In this case, Guaranteed and Burstable pods with the lowest priority are evicted first.
When responding to DiskPressure node condition, the Kubelet first evicts BestEffort pods followed by Burstable pods. Only when there are no BestEffort or Burstable pods left are Guaranteed pods evicted.
If the node runs out of memory before the Kubelet can reclaim it, the oom_killer kicks in to kill containers based on their oom_score. The oom_score is calculated by the oom_killer for each container and is based on the percentage of memory the container uses on the node as compared to what it requested plus the oom_score_adj score.
The oom_score_adj for each container is governed by the QoS class of the pod it belongs to. For a container inside a Guaranteed pod the oom_score_adj is “-998”, for a Burstable pod container it is “1000” and for a BestEffort pod container “min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)”.
The oom_killer first terminates containers that belong to pods with the lowest QoS class and which most exceed the requested resources. This means that containers belonging to a pod with a better QoS class (like Guaranteed) have a lower probability of being killed than one’s with Burstable or BestEffort QoS class.
This, however, is not true of all cases. Since the oom_killer also considers memory usage vs request, a container with a better QoS class might have a higher oom_score because of excessive memory usage and thus might be killed first.
QOS class determines the order in which the Kubernetes scheduler schedules pods as well as the order in which they are evicted by the Kubelet. DevOps can influence the QOS class of a pod by assigning resource limits and/or requests to individual containers belonging to the pod.
The QOS class of a pod can in some cases impact the resource utilization of individual nodes. Since resource requests and limits are mostly set based on guesstimates there are cases where Guaranteed and Burstable QOS pods have a resource footprint that is much higher than required. This can lead to sitautions where pods do not utilize the requested resources efficiently.
Replex’s Kubernetes solution analyzes historical resource requests, usage and utilization to provide actionable optimization recommendations to Kubernetes administrators and IT managers. IT managers can then right size infrastructure as well as optimize the resource footprint of individual pods, resulting in significant cost savings.
Fan of all things cloud, containers and micro-services!
FinOps is a cross domain discipline that represents a set of tools, best practices and processes aimed towards making software and infrastructure more cost effective. In this article we provide an introduction to Kubernetes Finops.
September 10, 2019
6 min read
In this article, we will dive into Kubernetes best practices for CIOs and CTOs. It is based on our blog series outlining best practices for DevOps and Kubernetes admins and provides a broader more zoomed-out view of best practices in production.
August 13, 2019
6 min read
The State of Kubernetes Report attempts to peer behind the curtain of the latest Kubernetes and cloud-native adoption trends. It is based on survey responses gathered from attendees at KubeCon + CloudNativeCon Europe 2019, O’Reilly Velocity Conference in San Jose and ContainerDays in Hamburg.
July 31, 2019
6 min read