Kubernetes in production. The term always reminds me of the phrase “here be dragons”: the medieval practice of illustrating unexplored areas of the map with dragons, sea-monsters, and other mythological creatures.
Deploying Kubernetes in production gives rise to some of the same feelings of fear and apprehension sailors had when setting sail into the great unknown.
But fear not, here at Replex we are committed to making Kubernetes adoption as seamless as possible. To that end, we have compiled an exhaustive checklist of best practices that DevOps teams and Kubernetes administrators can go through to ensure their Kubernetes workloads are production ready.
But what does production readiness mean for a Kubernetes cluster? Defining it is easy; Kubernetes is production ready when it is ready for live service. But there is a lot more to production readiness than definitions.
Download the Complete Production Readiness Checklist with Checks, Recipes and Best Practices for Availability, Resource Management, Security, Scalability and Monitoring
We like to think of production grade Kubernetes in terms of things like high availability, security, resource management, scalability, and monitoring. Some best practices and questions to ask when launching Kubernetes in production are; Have I run through all of these aspects of my Kubernetes environment? Do I know how my Kubernetes environment will perform during normal functioning? And how will my system behave in a worst case scenario for any one of these aspects?
With this checklist of production readiness best practices, DevOps and Kubernetes administrators can ensure their Kubernetes environment is production ready.
In this first blog instalment we will be looking at best practices for high availability. In follow-up posts, we will branch off into best practice checklists for resource management, availability, scalability, security and monitoring. You can also download the complete Kubernetes best practices checklist here.
The best practices will be backed up by Kubernetes recipes and tips on managing some of the concepts introduced and will include items on both the infrastructure as well as the application layer.
The checklist assumes that you have already set up your Kubernetes environment and your finger is now hovering over the big red launch button, looking for last minute guidance and best practices to ensure your Kubernetes deployment won’t blow up once you hit enter.
Liveness probe is the Kubernetes equivalent of “have you tried turning it off and on again”. Liveness probes detect containers that are not able to recover from failed states and restart them. It is a great tool to build-in auto recovery into production Kubernetes deployments. You can create liveness probes based on kubelet, http or tcp checks.
Readiness probes detect whether a container is temporarily unable to receive traffic and will mitigate these situations by stopping traffic flow to it. Readiness probes will also detect whether new pods are ready to receive traffic, before allowing traffic flow, during deployment updates.
Having the control plane replicated across 3 nodes is the minimum required configuration for a highly available Kubernetes cluster. Etcd requires a majority of master nodes to form a quorum and continue functioning. With 3 master nodes, the cluster can overcome the failure of 1 master node, since it still has 2 to form a majority.
Here is a table outlining the fault tolerance of different cluster sizes.
As is apparent from this table, master nodes should always be replicated in odd numbers. Odd-numbered master clusters have the same tolerance as the next highest even numbered cluster.
The etcd master component is responsible for storing and replicating cluster state. As such it has high resource requirements. Therefore, a best practice is to isolate the etcd replicas by placing them on dedicated nodes. This de-couples the control plane components and the etcd members and ensures sufficient resource availability for etcd members making the cluster more robust and reliable.
It is recommended to have at least a 5-member etcd cluster in production.
Since etcd stores cluster state, it is always a best practice to regularly backup etcd data. It is also a good idea to save etcd backup data on a separate host. etcd clusters can be backed up by taking a snapshot with the etcdctl snapshot save command or by copying the member/snap/db file from an etcd data directory.
When using public cloud provider storage volumes, it is relatively easy to create etcd backups by taking a snapshot of the storage volume.
Distributing master nodes across zones is also a high availability best practice. This ensures that master nodes are immune to outages of entire availability zones.
Using Kops, master nodes can be easily distributed across zones using the --master-zones flag.
Worker nodes should also be distributed across availability zones. Worker nodes can be distributed across zones by using the --zones flag in Kops.
When using the cloud, a best practice is to place both master and worker nodes in autoscaling groups. Autoscaling groups will automatically bring up a node in the event of termination. Kops places both master and workers nodes into autoscaling groups by default.
Once multiple master replica nodes have been deployed, the next obvious step is to load balance traffic to and from those replicas. You can do this by creating an L4 load balancer in front of all apiserver instances and updating the DNS name appropriately or use the round-robin DNS technique to access all apiservers directly. Check this document for more information.
As opposed to the other control plane components, the scheduler and controller manager components of the control plane have to read and write data actively, therefore they need to be configured in an active-passive setup. Once both components have been replicated across zones, they should be configured in an active-passive setup.
This can be done by passing the --leader-elect flag to kube-scheduler.
To ensure highly available Kubernetes workloads, pods should also be replicated using Kubernetes controllers like ReplicaSets, Deployments and Statefulsets.
Both deployments and statefulsets are central to the concept of high availability and will ensure that the desired number of pods is always maintained. The number of replicas is usually dictated by application requirements.
Kubernetes does recommend using Deployments over Replicasets for pod replication since they are declarative and allow you to roll back to previous versions easily. However, if your use-case requires custom updates orchestration or does not require updates at all, you can still use Replicasets.
Are all your pods part of a Replicaset or Deployment? Naked pods are not re-scheduled in case of node failure or shut down. Therefore, it is best practice to always spin up pods as part of a Replicaset or Deployment.
If you are provisioning multiple clusters for low latency, availability and scalability, setting up Kubernetes federation is a best practice. Federation will allow you to keep resources across clusters in sync and auto-configure DNS servers and load balancers.
When configuring etcd clusters, it is important to correctly specify both heartbeat and election timeout parameters. Heartbeat interval is the frequency with which the etcd leader notifies followers. Timeout interval is the time period a follower will wait for a heartbeat before attempting to become a leader itself.
The heartbeat interval is recommended to be the round trip time between the members. Election timeouts are recommended to be at least 10 times the round trip time between members.
Ingress allows HTTP and HTTPS traffic from the outside internet to services inside the cluster. Ingress can also be used for load balancing, terminating SSL and to give services externally-reachable URLs.
In order for ingress to work, your cluster needs an ingress controller. Kubernetes officially supports GCE and nginx controller as of now. Here is a list of other ingress controllers you might want to check out.
You can also create an external cloud load balancer in place of the ingress resource, by including type: LoadBalancer in the Service configuration file.
To recap here are the recommended Kubernetes best practices for high availability:
Download the Complete Checklist with Checks, Recipes and Best Practices for Resource Management, Security, Scalability and Monitoring for Production-Ready Kubernetes
Fan of all things cloud, containers and micro-services!
A review of the best practices, processes and cultural paradigms that are recommended by the FinOps foundation. These best practices and processes are instrumental in developing and operating a successful FinOps practice that views the cloud as a driver of innovation and business value while at the same time improving transparency and accountability.
April 12, 2021
7 min read
Part 3 of the Ultimate guide to cloud FinOps blog series, which outlines core FinOps principles, and provides an in-depth review of each one.
April 6, 2021
7 min read
Part 2 of the Ultimate guide to cloud FinOps blog series, which takes a deep dive into FinOps domains and roles, reviews the main responsibilities of those domains and identifies the current organizational roles that are candidates for inclusion in FinOps teams.
March 22, 2021
7 min read