[Kubernetes, Kubernetes in Production]
This is the second instalment in our blog series about monitoring Kubernetes resource metrics in production. In this post, we complement the process of Kubernetes resource monitoring with Prometheus by installing Grafana and leveraging the Prometheus data source to create information-rich dashboards in a user-friendly visual format.
Hasham Haider
April 3, 2019
13 minute read
This is the second instalment in our guide to monitoring Kubernetes resource metrics in production. In the first instalment, we identified Kubernetes resource metrics that need to be monitored. We did this by combining hardware and OS metrics exposed by the underlying Linux kernels of containers, with the unique software and hardware abstractions Kubernetes introduces and it’s resource management model.
We also delved into why monitoring these metrics is important, went through the process of installing Metrics-Server and Prometheus and outlined Prometheus expressions for monitoring some of these metrics.
In this section, we will move further down the monitoring pipeline and will introduce Grafana. Grafana is an open source platform for analytics and metric visualization. We will go through the process of setting up Grafana for our Kubernetes cluster and will then create a dashboard incorporating the resource metrics we have identified.
Alright so now that we have defined the context for this blog post, let’s start by setting up Grafana.
Grafana is a part of the Prometheus operator project. We have already covered installation of the Prometheus operator in the first instalment of this blog series. Below is a quick recap:
Download the latest Helm version from here. I downloaded the helm-v2.11.0-linux-amd64.tar.gz version.
Unpack it:
tar -zxvf helm-v2.11.0-linux-amd64.tar.gz
Move it to your bin directory:
mv linux-amd64/helm/usr/local/bin/helm
Initialize helm and install tiller:
helm init
Create a service account:
kubectl create serviceaccount --namespace kube-system tiller
Bind the new service account to the cluster-admin role and give tiller admin access to the entire cluster:
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
Deploy tiller and add the line serviceAccount: tiller to spec.template.spec:
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
Install the Prometheus operator:
helm install --name prom-operator stable/prometheus-operator --namespace monitoring
This will install the Prometheus operator in the namespace monitoring. You can see the Grafana instance running in this namespace using:
kubectl --namespace kube-system get pods
Now forward the Grafana instance to a port on your local machine:
kubectl port-forward -n monitoring prometheus-operator-grafana-7859656fc4-m77cz 3000:3000
Make sure you use the correct Grafana pod name. Access the dashboard by navigating to http://localhost:3000. You will be asked to provide a username and password to login. Use the following credentials:
username: admin and password: prom-operator
This completes the Grafana setup. We are now ready to start creating a Kubernetes dashboard based on the metrics we have identified.
To create a dashboard click on the Home button in the top right corner of the Grafana home screen, select New Dashboard from the drop-down list and click on the Graph icon. This will create a new dashboard with a placeholder panel.
Before we update the panel, however, we first have to create the Prometheus data source variable. We can do this by clicking on the cog icon (settings) in the top right corner of the Grafana dashboard and navigating to variables in the settings panel to the left. Click on Add variable and fill in the fields for name and type as shown in the screenshot below:
Now that we have created the Prometheus data source variable we are ready to create the dashboard layout. We are basing the Dashboard layout on Kubernetes abstractions including pods, nodes, namespaces and clusters. The dashboard will have separate sections for each abstraction with individual usage, request and utilization metrics.
Let's create the basic dashboard layout first by adding separate sections for Pods, Nodes, Namespaces and Clusters.
To create a new section click on Add Panel in the top right corner of dashboard and then click on Row. Edit the Row title by clicking on the cog icon next to it and rename it Pod. Do the same for Node, Namespace and Cluster.
To make the dashboard setup modular, let’s also create a couple of template panels that we can re-use later on. We will be tracking metrics for two types of resources; CPU and Memory. The panels for these resources share a number of settings, so we will create two template panels for each. Follow the instructions below to create the Memory and CPU Template panels:
Click on the Panel Title and Edit. Fill out the Axes, Legend and Display tabs as shown in the screenshot below:
Rename the panel to CPU Template Panel under the General Tab.
For the Memory Template Panel create a new graph and fill out the Axes, Legend and Display tabs as shown in the screenshot below:
Rename the panel to Memory Template Panel under the General Tab.
Once finished you will be able to see both templates on the main dashboard view:
Now that we have created the template panels, we are ready to create individual panels for each of the sections. Let's start with the Pod section.
We want to have four panels in this section; one each for Pod level CPU Usage, CPU Requests, Memory Usage and Memory Requests.
Duplicate both CPU and Memory Template panels, by clicking on the panel title, More and Duplicate and add them to the Pod section.
Now we are ready to update the template panels with the individual expressions for each resource metric. Let’s start with Pod CPU Usage.
Click on the panel title and Edit for one of the duplicated CPU Template panels. Under the Metrics tab, paste the following into the field marked A:
sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m])) by (pod_name)
Type "{{pod_name}}" into the Legend format field and change the dashboard panel name to CPU Usage under the General tab.
Click on Save Dashboard in the top right corner and browse back to the main dashboard screen. We can now see the template for Pod CPU Usage on the main dashboard screen:
You can also arrange the current CPU Usage table on the panel, in either ascending or descending order by clicking on the current title. In descending mode, pods using the most CPU at the current time are displayed.
For Pod CPU Requests edit the second duplicated CPU Template panel and enter the following expression into Field A under the Metrics tab:
sum(kube_pod_container_resource_requests_cpu_cores) by (pod)
Enter "{{pod}}" in the Legend format field and update the panel name to CPU Requests
This will update the duplicated panel on the main dashboard screen to show Pod CPU requests:
For Pod Memory Usage and Requests edit the duplicated Memory Template panels update panel names under the General tab and enter the following expressions into Field A under the Metrics tab:
For Pod Memory Usage:
sum(container_memory_usage_bytes{container_name!="POD",container_name!=""}) by (pod_name)
For Pod Memory requests:
sum(kube_pod_container_resource_requests_memory_bytes) by (pod)
Also, enter "{{pod}}" in the Legend format field for both panels.
You should now be able to see all four dashboard panels under the Pod row:
Next, we will add panels for Node level resource metrics to the Kubernetes dashboard.
As with Pods, we will add four panels to the Node section of our Kubernetes dashboard. The first two panel metrics are similar to the ones for pods; Node CPU and Memory Usage. However, in place of CPU Requests and Memory Requests, we will be adding CPU Utilization and Memory Utilization panels to the Nodes section.
Let’s start off by duplicating both the CPU and Memory Template Panels. We will first setup the Node CPU Usage panel. This resource metric will aggregate the resource usage for all pods running on that specific node.
Update one of the duplicated CPU Template panels by changing the name to CPU Usage and entering the following metric in Field A under the Metrics tab:
sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m])) by (node)
Additionally, enter "{{node}}" into the Legend Format field. This will update the duplicated panel to show Node CPU Usage on the main Dashboard.
For Memory Usage edit the duplicated Memory Template panel and enter the following expression into Field A under the Metrics tab:
sum(container_memory_usage_bytes{container_name!="POD",container_name!=""}) by (node)
Enter "{{node}}" in the Legend format field and update the panel name to Memory Usage.
Save the dashboard and browse back to the main dashboard screen:
Next, we will create the Node CPU Utilization and Memory Utilization panels by updating Field A under the Metrics tab with the following metrics:
For Node CPU Utilization:
node:node_cpu_utilisation:avg1m
For Node Memory Utilization:
node:node_memory_utilisation:
Next, enter "{{node}}" into the Legend format field. Change the Unit under Axes tab to none>>percent(0.0-1.0) and update the panel names. This will create both the CPU and Mameory Utilization panels on the main dashboard screen under the Node section:
As with Pods, we will add four panels on a namespace level to our Kubernetes dashboard. These will include one each for CPU usage, CPU requests, Memory Usage and Memory requests.
Duplicate the CPU Template panels and enter the following expressions into Field A:
For Namespace CPU Usage:
sum(rate(container_cpu_usage_seconds_total{container_name!="POD", namespace!=""}[5m])) by (namespace)
For Namespace CPU Requests:
sum(kube_pod_container_resource_requests_cpu_cores{container_name!="POD", namespace!=""}) by (namespace)
Next, type "{{namespace}}" in the Legend format field.
Browse back to the main Grafana dashboard where you can now see both the Namespace CPU Usage and Request panels:
Next, we will create the Namespace Memory Usage and Memory Requests dashboard panels. Update Field A for the duplicated Memory Template panels with the following expressions:
For Namespace Memory Usage:
sum(container_memory_usage_bytes{container_name!="POD",container_name!=""}) by (namespace)
For Namespace Memory Requests:
sum(kube_pod_container_resource_requests_memory_bytes) by (namespace)
Type "{{namespace}}" into the Legend format field.
Now we can see all four Namespace panels on the main dashboard screen:
Let’s now add resource metrics on the cluster level to our Grafana Kubernetes dashboard.
We will add four panels to this section of our dashboard; Cluster CPU utilization, memory utilization, and CPU and memory request commitments. CPU and memory request commitment metrics compare the sum of CPU and memory requests for all pods to the total capacity of the cluster.
Let’s start off with Cluster CPU and Memory Utilization. These resource metrics will be represented as percentages and will show us the extent to which the pods running on our nodes are utilizing the CPU or memory resources available on the node.
Duplicate the CPU and Memory Template panels, rename them and enter the following expressions into Field A:
For Cluster CPU Utilization:
:node_cpu_utilisation:avg1m
For Cluster Memory Utilization:
:node_memory_utilisation:
Under the Axes tab, choose percent (0.0-1.0) as the Unit and enter Cluster into the Legend format field.
Now save the dashboard and browse back to the main Grafana screen:
Now let’s add CPU and Memory Request commitments. We can do this by entering the following expressions into Field A of the duplicated CPU and Memory Template panels:
For CPU Request Commitment
sum(kube_pod_container_resource_requests_cpu_cores) / sum(node:node_num_cpu:sum)
For Memory Request Commitment
sum(kube_pod_container_resource_requests_memory_bytes) / :node_memory_MemTotal:sum
Choose percent (0.0-1.0) as the Unit under the Axes tab and enter Cluster into the Legend format field.
Saving the dashboard and browsing back to the main view will show us all four Cluster level panels:
This post was a continuation of our “Monitoring Kubernetes resources in Production” blog series. In the first instalment we defined Kubernetes resources and identified resource metrics based on hardware and OS metrics, Kubernetes primitives and it’s resource management model. We also looked into the process of setting up Prometheus and outlined Prometheus expressions for those metrics.
In this post, we complemented the process of Kubernetes resource monitoring with Prometheus by installing Grafana. Grafana allows us to leverage the Prometheus data source and create information-rich graphs in a user-friendly visual format. We also created a Kubernetes dashboard based on the resource metrics we identified.
A resource monitoring architecture setup using Prometheus and Grafana, while giving us insight into resource consumption and utilization for native Kubernetes objects like pods, containers and namespaces does, however, fall short when it comes to custom groupings.
Custom groupings are usually driven by organizational imperatives and can include teams, applications, clients and departments etc. Understanding the consumption and utilization profiles of these custom groupings is important for ensuring efficient resource usage and future forecasting. Additionally, Prometheus and Grafana do not provide any cost related metrics; either for native Kubernetes objects or for custom groupings.
Replex’s dedicated Kubernetes solution provides consumption and utilization metrics across both native Kubernetes objects as well as custom groupings out of the box. It complements this with granular cost visibility and transparency, by co-relating resource consumption with the individual cost profiles of all infrastructure and abstraction layers.
Replex’s optimize module also provides DevOps teams and IT managers actionable intelligence on optimizing infrastructure footprint. It does this by ingesting and analyzing historical and real-time consumption and utilization metrics. DevOps teams can then right-size their infrastructure leading to cost savings of up-to 30%. The optimize module can also be run in automated mode, where it automaically right-sizes infrastructure based on these signals.
Getting Ready for Kubernetes in Production? Download the Complete Production Readiness Checklist with Checks, Recipes and Best Practices for Availability, Security, Scalability, Resource Management and Monitoring
Author
Fan of all things cloud, containers and micro-services!
Part five of our Kubernetes and Cloud native application checklist evaluates cloud native storage tools based on ease of installation and continued operations in cloud native environments as well as the feature set provided.
June 15, 2020
13 min read
Read article
A comprehensive guide to managed Kubernetes distributions outlining the features that CIOs, CTOs and ITDMs need to consider when evaluating enterprise Kubernetes distributions.
June 8, 2020
13 min read
Read article
Cloud native has taken the IT landscape by storm. But what is it? We sat down with Pini Reznik, CTO at Container Solutions and co-author of “Cloud Native Transformation: Practical Patterns for Innovation” to try and figure out what exactly Cloud native is, which specific technology pieces, processes and cultural dynamics need to come together to create Cloud native environments and the best way for organisations to forge into the Cloud native future.
April 22, 2020
13 min read
Read article