We recently wrote a series of articles about Kubernetes best practices in production. The series outlines Kubernetes best practices from a resource management, disaster recovery, availability, security, scalability, monitoring and governance perspective. It digs into the internals of Kubernetes and is aimed towards DevOps teams in the trenches, getting their hands dirty with Kubernetes on a daily basis.
With this article, we intend to provide a more zoomed out view of Kubernetes best practices in production. The article distils out the main learnings from the earlier series and is targeted towards CIOs and CTOs. It will dig into best practices using some of the same attributes of production workloads from the previous series, including monitoring, availability, security and governance.
We will also outline best practices for CIOs and CTOs to help align organisational culture with the new realities of distributed DevOps and SRE teams, the increasing skill overlap between traditional devs and ops and new CI/CD paradigms for software release cycles. So let’s jump right into it.
Download the Complete CIOs Guide to Kubernetes
The cloud-native set of tools have changed the way software is developed, deployed and managed. This new toolset has necessitated a shift in the way both the tools themselves as well as the applications propped up by them are monitored.
The same is true of Kubernetes, which introduces a number of new abstractions on both the hardware as well as the application layer. Any monitoring pipeline for Kubernetes needs to take both these new abstractions as well as its resource management model into account.
This means that in addition to monitoring historically relevant infrastructure metrics like CPU and RAM utilisation for cloud VMs and physical machines, logical abstractions like pods, services and replica sets also need to be considered.
More importantly, however, Kubernetes monitoring needs to pivot to a new observability paradigm. Traditionally organisations have relied on black box monitoring methods to monitor infrastructure and applications. Black box monitoring observes only the external behaviour of a system.
In the cloud-native age of containers, orchestration, and microservices, monitoring needs to move beyond black box monitoring. Black box monitoring can still serve as the baseline for a monitoring strategy but it needs to be complemented by newer white box monitoring methods more suited to the distributed, ephemeral nature of containers and Kubernetes.
Observability encompasses both traditional black box monitoring methods in addition to newer monitoring paradigms like logging, tracing and metrics (together known as white box monitoring). Observability pipelines decouple data collection from data ingestion by introducing a buffer.
The pipeline serves as the central repository of traces, metrics, logs and events which are then forwarded to the appropriate service using a data router. This mitigates the need to have agents for each destination running on each host and reduces the number of integrations that need to be maintained. It also allows enterprises to avoid vendor lock-in and quickly test new SaaS-based monitoring services.
Observability aims to understand the internals of a system and how it works to quickly debug and resolve issues in production. Since it integrates logs, traces and metrics into traditional monitoring pipelines it covers much more ground and requires a lot more effort to deploy.
A best practice, therefore, is for CIOs and CTOs to gradually build towards a full observability pipeline for their cloud-native environments by integrating elements of white box monitoring over time.
The adoption of cloud-native technologies has also resulted in much more overlap between traditional dev and ops teams. Observability pipelines allow organisations to better integrate these teams by helping build a culture based on facts and feedback.
High availability and disaster recovery are crucial elements of any enterprise application. Orchestration engines like Kubernetes introduce additional layers which have to be considered when designing highly available architectures.
Highly available Kubernetes environments can be seen in terms of two distinct layers or levels. The bottom-most layer is the infrastructure layer, which can refer to any number of public cloud providers or physical infrastructure in a data centre. Next is the orchestration layer which includes both hardware and software abstractions like nodes, clusters, containers and pods as well as other application components.
Public cloud providers provide a number of high availability mechanisms for compute, storage and networking that should serve as a baseline for any Kubernetes environment. CIOs and CTOs also need to bake in redundancy into compute, storage and networking equipment supporting Kubernetes environments in on-premise data centres.
On the orchestration layer, a multi-master Kubernetes cluster is a good starting point. Master nodes should also be distributed across cloud provider zones to ensure they are not affected by outages in any one zone.
Availability on the orchestration layer, however, needs to move beyond simple multi-master clusters. A best practice is to provision a minimum of 3 master nodes distributed across multiple zones. Similarly, worker nodes should also be distributed across zones for high availability.
In addition to having at least 3 master nodes, a best practice is to replicate the etcd master component and place it on dedicated nodes. It is recommended to have at least 5 etcd members for production clusters.
On the application layer, CIOs and CTOs need to ensure the use of native Kubernetes controllers like statefulsets or deployments. These will ensure that the desired number of pod replicas are always up and running.
Backup and disaster recovery should also figure at the top of every CIOs to-do list for Kubernetes clusters in production. The etcd master component is responsible for storing the cluster state and configuration. Having a plan for regular etcd backups is, therefore, a best practice. Stateful workloads on Kubernetes leverage persistent volumes which also need to be backed up.
Backup and disaster recovery are important elements of mission-critical enterprise applications. CTOs and CIOs need to have a well thought out and comprehensive high availability, backup and disaster recovery mechanism for Kubernetes, that encompasses all layers.
The future of enterprise software is moving towards containerised microservices based distributed applications deployed on Kubernetes with the cloud as an underlying layer. This new cloud-native landscape needs to be reflected in the way dev and ops teams are organised internally as well as in the software release cycle.
Kubernetes and cloud-native technologies have changed traditional dev and ops roles, broken down siloed dev and ops teams as well as changing the entire software release lifecycle. Given these paradigm changes, we will outline best practices for CIOs and CTOs in terms of role definitions, team composition and new paradigms for developing and deploying software.
DevOps has already broken up the siloed development, testing and operations teams that serviced traditional monolithic applications. More and more developer teams are internalising ops skills.
In the new cloud-native world, however, the boundary between dev and ops has blurred even more. CIOs and CTOs need to ensure that every DevOps team has the required skills and knowledge to automate, monitor and optimize the distributed, cloud-native applications being developed. They should also have the required skills to ensure highly available and scalable applications, implement networking as well as onboard the tools required throughout the application lifecycle.
One way to inject these skills into already existing DevOps teams is to move towards SRE. SRE is an implementation of DevOps, developed internally by Google that pushes for an even more overlapping skill set for individual developers. SREs typically divide their time equally between development and ops responsibilities.
In the context of Kubernetes, a best practice for CIOs and CTOs is to sprinkle SREs among DevOps teams. These SREs would, in turn, be responsible for both development as well as managing performance, on-boarding tools, building in automation and monitoring.
The increasingly distributed nature of enterprise applications translating into distributed DevOps teams, however, does not mean that central IT loses its significance. There does need to be some degree of control and oversight over these teams.
Even though organisations increasingly prefer developers with cross-domain knowledge of ops, overlapping skills do tend to dilute both development and ops.
A best practice, therefore, is to have a central IT team that includes personnel with ops and infrastructure skill sets. This skill set will enable central IT to provide DevOps teams with critical services that are shared by those teams. It will also ensure that organisations avoid wasted effort due to distributed teams figuring out solutions to shared problems.
Both the cloud and Kubernetes itself have made it increasingly easier for teams to provision and consume resources. The cloud-native movement and DevOps also emphasize on agility and the ability to self-service resources. This can at times lead to an explosion in the number of compute resources provisioned and can potentially lead to wastage and inefficient resource usage. A strong central IT team will be able to govern these distributed teams and avoid the fallouts from self-service and ballooning resources. They will also be able to hold teams accountable.
In the same way that Kubernetes and the wider cloud-native technology toolset made CIOs rethink traditional dev and ops roles, it has also required a new way of thinking about build and release cycles. Containerised, microservices based applications, developed, deployed and managed by distributed teams, are not very suited to traditional one-dimensional build and release pipelines.
A best practice for CIOs, therefore, is to support distributed teams with a well-tooled and thought-out CI/CD pipeline. A robust CI/CD pipeline is essential to fully realising the benefits of faster release cycles and agility promised by Kubernetes and cloud-native technologies. There are a number of tools that CIOs and CTOs can use to deploy CI/CD pipelines. These include Jenkins, TravisCI, GitLab CI and Spinnaker.
CI/CD is a broad concept and touches on aspects of development, testing and operations. When deploying a CI/CD pipeline from scratch a good place to start is with the developer team. Continuous integration is a subset of CI/CD that aims to increase the frequency of code merges and automate build and test processes.
Instead of developing new features in isolation, developers are encouraged to merge code into the main pipeline as frequently as possible. An automated build is created from these code changes which is then run through a suite of automated tests. Getting developer teams to adopt CI best practices will ensure that code changes and new features are always ready to be pushed out to production.
Once CI practices are firmly in place, CIOs and CTOs can then move on to continuous delivery and deployment. Continuous delivery is an extension of continuous integration where code changes are run through more rigorous tests and ultimately deployed to an environment that closely mirrors the production environment.
With continuous delivery there is often a human element involved making decisions about when and how frequently to push code into production. Continuous deployment automates the entire pipeline by automatically pushing code into production once it passes the automated builds and tests defined in both the integration and delivery phases.
Agile distributed teams working in isolation can at times lead to an explosion in the number of isolated build pipelines. To avoid this, a best practice for CIOs is to make the CI/CD pipeline the only way to push code into production. This will ensure that all code changes are pushed into a unified build pipeline and are subjected to the a consistent set of integration and test suites.
Distributed teams also tend to use a number of different tools and frameworks. CIOs need to ensure that the CICD pipeline is flexible enough to accommodate this usage.
Another best practice is to encourage a culture of small incremental code changes and frequent merges among developer teams. Smaller changes are easier to integrate and roll back and minimise the fallout if something goes wrong.
CIOs also need to institute a build once policy at the start of the pipeline. This ensures that later phases of the CI/CD pipeline have a consistent build to work with. It also avoids any inconsistencies that can creep in when using multiple build tools.
Additionally, CIOs need to strike a balance between the extent of the testing regime they push code changes through and the speed of the pipeline itself. More rigorous testing regimes while minimising the chances of bad code being pushed to production also have a time overhead.
CI/CD pipelines even though championing decentralisation and agility do still need to be governed by central IT for major feature releases. CIOs and CTOs need to ensure they strike a balance between governance and oversight from central IT and the agility and flexibility of distributed teams. The need to ensure a degree of oversight that while allowing them control does not impact the release velocity of software and teams.
Even though Kubernetes on its own is vastly feature rich, mission-critical enterprise workloads need to be supported by more feature rich variants to provide required service levels.
There are a number of managed Kubernetes offerings from public cloud providers that CIOs and CTOs can evaluate. These managed offerings take over some of the heavy lifting involved in managing upgrades, patches and HA. Public cloud provider offerings do, however, restrict Kubernetes environments to a specific vendor and might not fit well with a future hybrid or multi-cloud strategy.
Commercial value added Kubernetes distributions are also available from vendors like Red Hat, Docker, Heptia, Pivotal and Rancher. Below we will outline some of the features CIOs and CTOs need to look for when choosing one.
High availability and disaster recovery: CIOs and CTOs need to look for distributions that support high availability out of the box. This would include support for multi-master architectures, highly available etcd components as well as backup and recovery.
Hybrid and multi-cloud support: Vendor lock-in is a very real concern for the modern enterprise. To ensure Kubernetes environments are portable, CIOs need to choose distributions that support a wide range of deployment models, from on-premise to hybrid and multi-cloud. Support for creating and managing multiple clusters is another feature that should be evaluated.
Management, upgrades and Operational support: Managed Kubernetes offerings also need to be evaluated based on ease of setup, installation, and cluster creation as well as day 2 operations including upgrades, monitoring and troubleshooting. A baseline requirement should be support for fully automated cluster upgrades with zero downtime. The solution chosen should also allow upgrades to be triggered manually. Monitoring, health checks, cluster and node metrics and alerts and notifications should also be a standard part.
Identity and access management: Identity and access management are important both in terms of security as well as governance. CIOs need to ensure that the Kuberntes distribution they choose supports integration with already existing authentication and authorization tools being used internally. RBAC and granular access control are also important feature sets that should be supported.
Networking and Storage: The Kubernetes networking model is highly configurable and can be implemented using a number of options. The distribution chosen should either have a native software-defined networking solution that covers the wide range of requirements imposed by different applications or infrastructure or support one of the more popular CNI based networking implementations including Flannel, Calico, kube-router or OVN etc. CIOs also need to ensure that the Kubernetes distribution they choose supports at a minimum, either flexvolume or CSI integration with storage providers as well as deployment on multiple cloud providers and on-premise.
Deploy, manage and upgrade applications: Kubernetes distributions being considered by CIOs also need to support a comprehensive solution for deploying, managing, and upgrading applications. A helm based, application catalog that aggregates both private and public chart repositories should be a minimal requirement.
Want to dig deeper? Download the Complete CIOs Guide to Kubernetes
Fan of all things cloud, containers and micro-services!
Part 4 of our Introduction to FinOps for Kubernetes: Challenges and Best Practices article series, which outlines a comprehensive list of best practices aimed at implementing FinOps processes for cloud native Kubernetes environments.
August 26, 2021
11 min read
In a recent report, CNCF identified "a more granular and active Kubernetes cost-monitoring strategy" as a primary means of reducing K8s cost. In this article we identify major takeaways from the report and outline the contours of a comprehensive Kubernetes cost monitoring strategy.
August 12, 2021
11 min read
Part 3 of our Introduction to FinOps for Kubernetes: Challenges and Best Practices article series, which outlines a comprehensive list of best practices aimed at implementing FinOps processes for cloud native Kubernetes environments.
July 12, 2021
11 min read