Kubernetes in Production: Readiness Checklist and Best Practices

Deploying Kubernetes in production is no easy task. Kubernetes environments need to hold up on the rough seas of production from an availability, security, scalability, resilience, monitoring and resource management perspective. To make the task easier we have compiled a checklist of best practices that DevOps and Kubernetes administrators can go through to ensure their Kubernetes deployments are production ready.

Hasham Haider

Hasham Haider

February 20, 2019

7 minute read

Kubernetes in production. The term always reminds me of the phrase “here be dragons”: the medieval practice of illustrating unexplored areas of the map with dragons, sea-monsters, and other mythological creatures.

Deploying Kubernetes in production gives rise to some of the same feelings of fear and apprehension sailors had when setting sail into the great unknown.

But fear not, here at Replex we are committed to making Kubernetes adoption as seamless as possible. To that end, we have compiled an exhaustive checklist of best practices that DevOps teams and Kubernetes administrators can go through to ensure their Kubernetes workloads are production ready.

But what does production readiness mean for a Kubernetes cluster? Defining it is easy; Kubernetes is production ready when it is ready for live service. But there is a lot more to production readiness than definitions.

Download the Complete Production Readiness Checklist with Checks, Recipes and Best Practices for Availability, Resource Management, Security, Scalability and Monitoring 

Download Checklist

Kubernetes in Production: What is it Really?

We like to think of production grade Kubernetes in terms of things like high availability, security, resource management, scalability, and monitoring. Some best practices and questions to ask when launching Kubernetes in production are; Have I run through all of these aspects of my Kubernetes environment? Do I know how my Kubernetes environment will perform during normal functioning? And how will my system behave in a worst case scenario for any one of these aspects?

With this checklist of production readiness best practices, DevOps and Kubernetes administrators can ensure their Kubernetes environment is production ready.

In this first blog instalment we will be looking at best practices for high availability. In follow-up posts, we will branch off into best practice checklists for resource management, availability, scalability, security and monitoring. You can also download the complete Kubernetes best practices checklist here.

The best practices will be backed up by Kubernetes recipes and tips on managing some of the concepts introduced and will include items on both the infrastructure as well as the application layer.

The checklist assumes that you have already set up your Kubernetes environment and your finger is now hovering over the big red launch button, looking for last minute guidance and best practices to ensure your Kubernetes deployment won’t blow up once you hit enter.

High Availability

Configured Liveness and Readiness Probes? 

Liveness probe is the Kubernetes equivalent of “have you tried turning it off and on again”. Liveness probes detect containers that are not able to recover from failed states and restart them. It is a great tool to build-in auto recovery into production Kubernetes deployments. You can create liveness probes based on kubelet, http or tcp checks.

Readiness probes detect whether a container is temporarily unable to receive traffic and will mitigate these situations by stopping traffic flow to it. Readiness probes will also detect whether new pods are ready to receive traffic, before allowing traffic flow, during deployment updates.

Provisioned at Least 3 Master Nodes?

Having the control plane replicated across 3 nodes is the minimum required configuration for a highly available Kubernetes cluster. Etcd requires a majority of master nodes to form a quorum and continue functioning. With 3 master nodes, the cluster can overcome the failure of 1 master node, since it still has 2 to form a majority.

Here is a table outlining the fault tolerance of different cluster sizes.

Replicated Master Nodes in Odd Numbers?

As is apparent from this table, master nodes should always be replicated in odd numbers. Odd-numbered master clusters have the same tolerance as the next highest even numbered cluster.

Isolated etcd Replicas?

The etcd master component is responsible for storing and replicating cluster state. As such it has high resource requirements. Therefore, a best practice is to isolate the etcd replicas by placing them on dedicated nodes. This de-couples the control plane components and the etcd members and ensures sufficient resource availability for etcd members making the cluster more robust and reliable.

It is recommended to have at least a 5-member etcd cluster in production.

Have a Plan for Regular etcd Backups?

Since etcd stores cluster state, it is always a best practice to regularly backup etcd data. It is also a good idea to save etcd backup data on a separate host. etcd clusters can be backed up by taking a snapshot with the etcdctl snapshot save command or by copying the member/snap/db file from an etcd data directory.

When using public cloud provider storage volumes, it is relatively easy to create etcd backups by taking a snapshot of the storage volume.

Distributed Master Nodes across Zones?

Distributing master nodes across zones is also a high availability best practice. This ensures that master nodes are immune to outages of entire availability zones.

Using Kops, master nodes can be easily distributed across zones using the --master-zones flag.

Distributed Worker Nodes across Zones?

Worker nodes should also be distributed across availability zones. Worker nodes can be distributed across zones by using the --zones flag in Kops.

Configured Autoscaling for Both Master and Worker Nodes?

When using the cloud, a best practice is to place both master and worker nodes in autoscaling groups. Autoscaling groups will automatically bring up a node in the event of termination. Kops places both master and workers nodes into autoscaling groups by default.

Baked-in HA Load Balancing?

Once multiple master replica nodes have been deployed, the next obvious step is to load balance traffic to and from those replicas. You can do this by creating an L4 load balancer in front of all apiserver instances and updating the DNS name appropriately or use the round-robin DNS technique to access all apiservers directly. Check this document for more information.

Configured Active-Passive Setup for Scheduler and Controller Manager?

As opposed to the other control plane components, the scheduler and controller manager components of the control plane have to read and write data actively, therefore they need to be configured in an active-passive setup. Once both components have been replicated across zones, they should be configured in an active-passive setup.

This can be done by passing the --leader-elect flag to kube-scheduler.

Configured the Correct Number of Pod Replicas for High Availability?

To ensure highly available Kubernetes workloads, pods should also be replicated using Kubernetes controllers like ReplicaSets, Deployments and Statefulsets.

Both deployments and statefulsets are central to the concept of high availability and will ensure that the desired number of pods is always maintained. The number of replicas is usually dictated by application requirements.

Kubernetes does recommend using Deployments over Replicasets for pod replication since they are declarative and allow you to roll back to previous versions easily. However, if your use-case requires custom updates orchestration or does not require updates at all, you can still use Replicasets.

Spinning up any Naked Pods?

Are all your pods part of a Replicaset or Deployment? Naked pods are not re-scheduled in case of node failure or shut down. Therefore, it is best practice to always spin up pods as part of a Replicaset or Deployment.

Setup Federation for Multiple Clusters?

If you are provisioning multiple clusters for low latency, availability and scalability, setting up Kubernetes federation is a best practice. Federation will allow you to keep resources across clusters in sync and auto-configure DNS servers and load balancers.

Federating clusters involves first setting up the federation control plane and then creating federation API resources.

Configured Heartbeat and Election Timeout Intervals for etcd Members?

When configuring etcd clusters, it is important to correctly specify both heartbeat and election timeout parameters. Heartbeat interval is the frequency with which the etcd leader notifies followers. Timeout interval is the time period a follower will wait for a heartbeat before attempting to become a leader itself.

The heartbeat interval is recommended to be the round trip time between the members. Election timeouts are recommended to be at least 10 times the round trip time between members.

Setup Ingress?

Ingress allows HTTP and HTTPS traffic from the outside internet to services inside the cluster. Ingress can also be used for load balancing, terminating SSL and to give services externally-reachable URLs.

In order for ingress to work, your cluster needs an ingress controller. Kubernetes officially supports GCE and nginx controller as of now. Here is a list of other ingress controllers you might want to check out.

You can also create an external cloud load balancer in place of the ingress resource, by including type: LoadBalancer in the Service configuration file.

To recap here are the recommended Kubernetes best practices for high availability:

  • Configure Liveness and Readiness Probes
  • Provision at Least 3 Master Nodes
  • Replicate Master Nodes in Odd Numbers
  • Isolate etcd Replicas
  • Develop a Plan for Regular etcd Backups
  • Distribute Master Nodes across Zones
  • Distribute Worker Nodes across Zones
  • Configure Autoscaling for Both Master and Worker Nodes
  • Bake-in HA Load Balancing
  • Configure Active-Passive Setup for Scheduler and Controller Manager
  • Configure the Correct Number of Pod Replicas for High Availability?
  • Avoid Spinning up Naked Pods
  • Setup Federation for Multiple Clusters
  • Configure Heartbeat and Election Timeout Intervals for etcd Members
  • Setup Ingress

Download the Complete Checklist with Checks, Recipes and Best Practices for Resource Management, Security, Scalability and Monitoring for Production-Ready Kubernetes

Download Checklist

Kubernetes Production Readiness and Best Practices Checklist Kubernetes Production Readiness and Best Practices Checklist Cover Download Checklist
Hasham Haider


Hasham Haider

Fan of all things cloud, containers and micro-services!

Want to Dig Deeper and Understand How Different Teams or Applications are Driving Your Costs?

Request a quick 20 minute demo to see how you can seamlessly allocate Kubernetes costs while saving up to 30% on infrastructure costs using Replex.

Contact Us