[Kubernetes, Kubernetes in Production, Production Readiness Checklist, Kubernetes Best Practices]

Kubernetes in Production - Best Practices, How-to Guides and Tutorials

A comprehensive guide to Kubernetes in Production. This guide contains best practices, checklists, tutorials and how-to guides for production-ready Kubernetes. We will be updating this page periodically, so stay tuned!

Hasham Haider

Hasham Haider

May 2, 2019

9 minute read

If you are a Game of Thrones fan you probably know of the Sunset Sea. Many brave explorers, trying to find out its secrets, have set sail into its dark waters never to be seen or heard of again. This might be a little overboard (no pun intended), but DevOps trying out their luck with production-ready Kubernetes tend to face the same rough and choppy seas.

In this comprehensive guide, we will be collecting resources from across our blog on how to ensure smooth sailing for Kubernetes in Production. It will include deep dives into topics ranging from deployment and installation to best practices, tutorials and guides.

Following are some of the questions we intend to answer with this guide:

What does it take to make a Kubernetes environment production-ready?


What needs to happen to ensure frictionless continued operations for Kubernetes in production?

Let’s take a closer look at both.

What does it take to make a Kubernetes environment production-ready?

Before we answer that question, we need to benchmark production-ready Kubernetes. A production-ready Kubernetes environment is one which is ready to start serving traffic.

However, there is a lot more that needs to happen before a Kubernetes environment can be said to be production-ready. It also needs to be highly available and reliable with built in disaster-recovery as well as being secure, scalable and optimised. In enterprise settings, Kubernetes environments also need to hold up in terms of Governance, Compliance, Monitoring and Networking requirements.

One way of ensuring this is to develop a set of best practices to be followed whenever spinning up new Kubernetes environments. In this section, we will outline a checklist of best practices for Availability, Scalability, Security and Resource Management to ensure our Kubernetes environments are production-ready. 

What needs to happen to ensure frictionless continued operations for Kubernetes environments in production

In this section, we will discuss the process of setting up a monitoring pipeline for Kubernetes  in production. Kubernetes introduces a number of abstractions on both the application as well as the infrastructure layer. Additionally, the trend towards micro-service architectures and the fact that containers and pods are ephemeral entities introduce further monitoring challenges. A monitoring pipeline needs to take all of this into consideration when choosing which metrics to monitor as well as the tools to be used.

Let’s now move on to section one where we outline a set of best practices to ensure our Kubernetes environments are production-ready. We will go through all of the aspects mentioned above including Resource Management, Availability, Scalability and Security. We will be adding best practices for Governance, Compliance, Monitoring and Networking as well.

Kubernetes in Production: High Availability Best Practices

Baking in high availability and disaster recovery into Kubernetes environments is essential to ensure resilient and highly available applications. Spinning up multiple master nodes (5 is the recommended number) is a good start. High availability architectures, however, need to go beyond multi-master setups and bake in redundancies on both the application as well as the infrastructure layer. Below we outline some of the best practices that should be followed for Kubernetes in production:

Master nodes should be provisioned in odd numbers with at least a 5-member etcd cluster in production. Replicating both master and worker nodes across cloud provider zones is another best practice to ensure high availability. This will ensure that the cluster can survive outages in any one availability zone.

Another best practice is to isolate etcd replicas by placing them on dedicated nodes. This will help avoid any resource starvation induced outages for etcd members. Etcd data should also be backed up regularly since it stores cluster state and is essential for disaster recovery.

When replicating the scheduler and controller manager a best practice is to configure the replicated instances in an active-passive setup.

Here is the complete high availability best practice checklist for Kubernetes in production.

Kubernetes in Production: Resource Management Best Practices

Kubernetes abstracts resources from underlying cloud VMs or physical machines. These resources can then be allocated and consumed by individual containers. Kubernetes also introduces the concepts of soft and hard limits on the amount of resources that can be consumed. Containers are the lowest level of abstraction this can be done for. Namespaces are another.

Kubernetes resource consumption can also be managed on a number of different levels and abstractions. Below we outline best practices for managing resource consumption for Kubernetes in production.

Specifying resource requests and limits for individual containers is a best practice that should figure right at the top of any deployment checklists. Another best practice is to divide Kubernetes environments into separate namespaces for individual teams, departments, clients or applications.

Once Namespaces have been created, a best practice is to do the following for each individual namespace:

Specify minimum and maximum resource requests and limits for individual containers (LimitRange), configure default requests and limits (LimitRange), cap the total amount of resource requests and limits for all containers (Resource Quota) as well as the total number of Kubernetes objects including Pods, Services, PersistentVolumeClaims, Replicasets etc (Pod and API Quotas) that can be provisioned inside a namespace.

Additional best practices for resource management include enabling log rotation, configuring out of resource handling and using recommended settings for persistent volumes.

Here is the complete resource management best practice checklist for Kubernetes in production.

Before we move on to security best practices for Kubernetes in Production, let’s outline some key questions that IT managers and DevOps teams need to be asking about their production Kubernetes environments. These questions have been compiled in the context of resource management.

A good starting point is to ask how much resources individual containers or pods consume in production. DevOps teams can then build on this baseline consumption by comparing it to the amount of resources requested. This will essentially provide DevOps teams with resource utilisation metrics which can, in turn, be used to tweak resource requests and limits for containers to optimize usage.

However, it is also important to avoid resource exhaustion and ensure sufficient headroom for containers to deal with spikes in usage.

IT managers need to ask these questions on the infrastructure level. How much of the provisioned cloud resources (VMs) are being consumed by the pods running on top, and what is the average utilisation?

Here is the complete list of questions.

Kubernetes in Production: Security Best Practices

Security is central to modern application and infrastructure design. Kubernetes is a rapidly evolving platform and as such security can become an after-thought in the on-going race to catch up with new releases. 

Kubernetes, containerisation and the micro-services trend introduce new security challenges. The fact that Kubernetes pods can be easily spun up across all infrastructure classes leads by default to a lot more internal traffic between pods. This also means that the attack surface for Kubernetes is usually larger. Additionally, the highly dynamic and ephemeral environment of Kubernetes does not lend itself well to legacy security tools.

Below we review some security best practices for this highly dynamic and ephemeral Kubernetes environment.

A starter best practice for Kubernetes security is to always upgrade to the latest Kubernetes version. The latest version is most likely to include critical bug fixes and new security features.

Following user access best practices and enabling native RBAC for Kubernetes clusters is also an important best practice to follow. Kubernetes Namespaces can also serve a security purpose by allowing Kubernetes clusters to be chopped up into separate virtual partitions and containing the fallout from attacks. Other security best practices include enabling admission controllers like pod security policy, AlwaysPullImages, implementing authentication for Kubelet, enabling data encryption at rest and configuring Pod, Container and Volume security policies.

Download the complete list of Security best practices for Kubernetes in Production.

Kubernetes in Production: Scalability Best Practices

Kubernetes has a number of native built-in tools for scalability. These include the Cluster Autoscaler, the Horizontal Autoscaler and the Vertical Autoscaler. The Horizontal Pod Autoscaler automatically scales the number of pods in a Deployment or ReplicaSet based on CPU utilisation or custom metrics.

Similarly, the Vertical Pod Autoscaler configures resource requests and limits for containers based on utilisation metrics and the Cluster Autoscaler scales cluster size up and down by adding or removing nodes.

A best practice to ensure scalable Kubernetes environments in Production is to configure and use all three of these Autoscalers.

Download the complete best practices checklist for Kubernetes in Production.

Let’s now move on to section two where we go through the process of setting up a monitoring pipeline for Kubernetes. We will specifically be looking at monitoring resource metrics for Kubernetes.

Kubernetes in Production: Monitoring Resource Metrics

In the context of Kubernetes resource monitoring, it is important to consider the many abstractions Kubernetes introduces. These range from ones on the infrastructure level like namespaces, cluster and nodes to ones on the application layer like pods and containers.

Additionally, we also need to take into consideration the resource management model of Kubernetes which allows us to manage resources using resource requests and limits. Add to this the utilisation saturation and error method developed by Brendann Egg and we have an idea of the resource metrics that we need to be tracking in production.

We can start off with monitoring resource usage metrics for containers, pods, nodes and namespaces. Next, we add utilisation and saturation metrics for nodes and clusters to this list.

Here is a comprehensive list of the resource metrics we need to be tracking for Kubernetes in Production.

Once we have identified the metrics we can start looking at the tools to monitor them in production. We have a number of options when it comes to tools including metrics-server and Prometheus and Grafana.

Install metrics-server using:

git clone https://github.com/kubernetes-incubator/metrics-server
kubectl create -f metrics-server/deploy/1.8+/

Once installed we can query pod and Node resource usage using:

kubectl top pod


kubectl top node

As you can see metrics-server gives us access to only a couple of the resource metrics we identified. To monitor a larger set of Kubernetes resource metrics we need to deploy a tool like Prometheus.

We can install Prometheus using helm and the Prometheus operator from CoreOS. We have covered this in some detail in our comprehensive guide to monitoring Kubernetes resource metrics with Prometheus. Once installed Prometheus gives us access to a much larger set of Kubernetes resource metrics.

For example, we can track resource usage for Kubernetes objects like Pods and Namespaces as well as utilisation metrics for Nodes and Clusters. Following are some sample Prometheus expressions:

Kubernetes CPU usage by Pod: 

sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m])) by (pod_name) 

Kubernetes CPU usage by Namespace:

sum(rate(container_cpu_usage_seconds_total{container_name!="POD",namespace!=""}[5m])) by (namespace)

Here is an extensive list of Prometheus expressions to monitor resource metrics for individual Kubernetes objects.

To further beef up the monitoring pipeline and for a more interactive view of resource metrics, we integrate Grafana. Grafana allows us to create information-rich graphs in a user-friendly visual format. Grafana is also a part of the Prometheus operator project and is installed along with Prometheus.

Once installed we have the option of defining Prometheus as a data source in Grafana and porting over the resource metrics. There are some differences in expressions used for querying metrics in Prometheus and Grafana. We outline a complete list of these expressions for Grafana here.

Once we have the expressions identified, we can create a Grafana Kubernetes dashboard which tracks resource metrics for individual Kubernetes objects.

We cover this in exhaustive detail here where we create a Grafana dashboard with four sections; one each for Pods, Namespaces, Nodes and Clusters. Each section has individual graphs for tracking resource metrics from usage and requests to saturation and utilisation. Following are a couple of expressions with screenshots of the graphs:

Kubernetes memory usage per Node:

sum(container_memory_usage_bytes{container_name!="POD",container_name!=""}) by (node)

Kubernetes memory usage per Node

Kubernetes CPU usage per Namespace:

sum(rate(container_cpu_usage_seconds_total{container_name!="POD", namespace!=""}[5m])) by (namespace)

Kubernetes CPU usage per Namespace

Here is a detailed walkthrough of creating a Grafana dashboard to track resource metrics for Kubernetes in Production.

We will be updating this page with best practices, tutorials and guides, so stay tuned! 

Getting Ready for Kubernetes in Production? Download the Complete Production Readiness Checklist with Checks, Recipes and Best Practices for Availability, Security, Scalability, Resource Management and Monitoring

Download Checklist

Hasham Haider


Hasham Haider

Fan of all things cloud, containers and micro-services!

Want to Dig Deeper and Understand How Different Teams or Applications are Driving Your Costs?

Request a quick 20 minute demo to see how you can seamlessly allocate Kubernetes costs while saving up to 30% on infrastructure costs using Replex.

Schedule a Meeting