[Kubernetes, Kubernetes in Production]

Kubernetes in Production: The Ultimate Guide to Monitoring Resource Metrics with Grafana

This is the second instalment in our blog series about monitoring Kubernetes resource metrics in production. In this post, we complement the process of Kubernetes resource monitoring with Prometheus by installing Grafana and leveraging the Prometheus data source to create information-rich dashboards in a user-friendly visual format.

Hasham Haider

Hasham Haider

April 3, 2019

13 minute read

This is the second instalment in our guide to monitoring Kubernetes resource metrics in production. In the first instalment, we identified Kubernetes resource metrics that need to be monitored. We did this by combining hardware and OS metrics exposed by the underlying Linux kernels of containers, with the unique software and hardware abstractions Kubernetes introduces and it’s resource management model.

We also delved into why monitoring these metrics is important, went through the process of installing Metrics-Server and Prometheus and outlined Prometheus expressions for monitoring some of these metrics.

In this section, we will move further down the monitoring pipeline and will introduce Grafana. Grafana is an open source platform for analytics and metric visualization. We will go through the process of setting up Grafana for our Kubernetes cluster and will then create a dashboard incorporating the resource metrics we have identified.

Alright so now that we have defined the context for this blog post, let’s start by setting up Grafana.

Setting up Grafana

Grafana is a part of the Prometheus operator project. We have already covered installation of the Prometheus operator in the first instalment of this blog series. Below is a quick recap:

Download the latest Helm version from here. I downloaded the helm-v2.11.0-linux-amd64.tar.gz version.

Unpack it:

tar -zxvf helm-v2.11.0-linux-amd64.tar.gz

Move it to your bin directory:

mv linux-amd64/helm/usr/local/bin/helm

Initialize helm and install tiller:

helm init

Create a service account:

kubectl create serviceaccount --namespace kube-system tiller

Bind the new service account to the cluster-admin role and give tiller admin access to the entire cluster:

kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller

Deploy tiller and add the line serviceAccount: tiller to spec.template.spec:

kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'

Install the Prometheus operator:

helm install --name prom-operator stable/prometheus-operator --namespace monitoring

This will install the Prometheus operator in the namespace monitoring. You can see the Grafana instance running in this namespace using:

kubectl --namespace kube-system get pods

kubectl --namespace kube-system get pods

Now forward the Grafana instance to a port on your local machine:

kubectl port-forward -n monitoring prometheus-operator-grafana-7859656fc4-m77cz 3000:3000

Make sure you use the correct Grafana pod name. Access the dashboard by navigating to http://localhost:3000. You will be asked to provide a username and password to login. Use the following credentials:

username: admin and password: prom-operator

This completes the Grafana setup. We are now ready to start creating a Kubernetes dashboard based on the metrics we have identified.

Creating Grafana Dashboard for Kubernetes Resource Metrics

To create a dashboard click on the Home button in the top right corner of the Grafana home screen, select New Dashboard from the drop-down list and click on the Graph icon. This will create a new dashboard with a placeholder panel. 

Grafana-kubernetes-resources-dashboard-1

Create the Prometheus Data Source Variable

Before we update the panel, however, we first have to create the Prometheus data source variable. We can do this by clicking on the cog icon (settings) in the top right corner of the Grafana dashboard and navigating to variables in the settings panel to the left. Click on Add variable and fill in the fields for name and type as shown in the screenshot below:

Prometheus-data-source-variable-grafana-1

Grafana Kubernetes Dashboard Layout

Now that we have created the Prometheus data source variable we are ready to create the dashboard layout. We are basing the Dashboard layout on Kubernetes abstractions including pods, nodes, namespaces and clusters. The dashboard will have separate sections for each abstraction with individual usage, request and utilization metrics.

Let's create the basic dashboard layout first by adding separate sections for Pods, Nodes, Namespaces and Clusters.

Book cover of 'Kubernetes Cost Analysis and Allocation on AWS' and button to download

To create a new section click on Add Panel 34in the top right corner of dashboard and then click on Row. Edit the Row title by clicking on the cog icon next to it and rename it Pod. Do the same for Node, Namespace and Cluster.

Grafana-kubernetes-resources-dashboard-18

To make the dashboard setup modular, let’s also create a couple of template panels that we can re-use later on. We will be tracking metrics for two types of resources; CPU and Memory. The panels for these resources share a number of settings, so we will create two template panels for each. Follow the instructions below to create the Memory and CPU Template panels:

CPU Template Panel

Click on the Panel Title and Edit. Fill out the Axes, Legend and Display tabs as shown in the screenshot below:

Axes Tab

CPU-grafana-template-panel-1

Legend Tab

CPU-grafana-template-panel-3

Display Tab

CPU-grafana-template-panel-4

Rename the panel to CPU Template Panel under the General Tab.

Memory Template Panel

For the Memory Template Panel create a new graph and fill out the Axes, Legend and Display tabs as shown in the screenshot below:

Axes Tab

CPU-grafana-template-panel-4-16-1

Legend Tab

CPU-grafana-template-panel-3

Display Tab

CPU-grafana-template-panel-4

Rename the panel to Memory Template Panel under the General Tab.

Once finished you will be able to see both templates on the main dashboard view:

Grafana-kubernetes-resources-dashboard-19

Now that we have created the template panels, we are ready to create individual panels for each of the sections. Let's start with the Pod section.

Adding Pod Level Resource Metrics to the Grafana Kubernetes Dashboard

We want to have four panels in this section; one each for Pod level CPU Usage, CPU Requests, Memory Usage and Memory Requests.

Duplicate both CPU and Memory Template panels, by clicking on the panel title, More and Duplicate and add them to the Pod section. 

Now we are ready to update the template panels with the individual expressions for each resource metric. Let’s start with Pod CPU Usage.

Click on the panel title and Edit for one of the duplicated CPU Template panels. Under the Metrics tab, paste the following into the field marked A:

sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m])) by (pod_name)

Type "{{pod_name}}" into the Legend format field and change the dashboard panel name to CPU Usage under the General tab.

Grafana-kubernetes-resources-dashboard-pod-cpu-usage-36

Click on Save Dashboard 35 in the top right corner and browse back to the main dashboard screen. We can now see the template for Pod CPU Usage on the main dashboard screen:

Grafana-kubernetes-resources-dashboard-23

You can also arrange the current CPU Usage table on the panel, in either ascending or descending order by clicking on the current title. In descending mode, pods using the most CPU at the current time are displayed.

For Pod CPU Requests edit the second duplicated CPU Template panel and enter the following expression into Field A under the Metrics tab:

sum(kube_pod_container_resource_requests_cpu_cores) by (pod)

Enter "{{pod}}" in the Legend format field and update the panel name to CPU Requests

This will update the duplicated panel on the main dashboard screen to show Pod CPU requests:

Grafana-kubernetes-resources-dashboard-22

For Pod Memory Usage and Requests edit the duplicated Memory Template panels update panel names under the General tab and enter the following expressions into Field A under the Metrics tab:

For Pod Memory Usage:

sum(container_memory_usage_bytes{container_name!="POD",container_name!=""}) by (pod_name)

For Pod Memory requests:

sum(kube_pod_container_resource_requests_memory_bytes) by (pod)

Also, enter "{{pod}}" in the Legend format field for both panels.

You should now be able to see all four dashboard panels under the Pod row:

Grafana-kubernetes-resources-dashboard-24

Next, we will add panels for Node level resource metrics to the Kubernetes dashboard.

Adding Node Level Resource Metrics to the Grafana Kubernetes Dashboard

As with Pods, we will add four panels to the Node section of our Kubernetes dashboard. The first two panel metrics are similar to the ones for pods; Node CPU and Memory Usage. However, in place of CPU Requests and Memory Requests, we will be adding CPU Utilization and Memory Utilization panels to the Nodes section.

Let’s start off by duplicating both the CPU and Memory Template Panels. We will first setup the Node CPU Usage panel. This resource metric will aggregate the resource usage for all pods running on that specific node.

Update one of the duplicated CPU Template panels by changing the name to CPU Usage and entering the following metric in Field A under the Metrics tab:

sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m])) by (node)

Additionally, enter "{{node}}" into the Legend Format field. This will update the duplicated panel to show Node CPU Usage on the main Dashboard.

Grafana-kubernetes-resources-dashboard-node-cpu-usage

For Memory Usage edit the duplicated Memory Template panel and enter the following expression into Field A under the Metrics tab:

sum(container_memory_usage_bytes{container_name!="POD",container_name!=""}) by (node)

Enter "{{node}}" in the Legend format field and update the panel name to Memory Usage.

Save the dashboard and browse back to the main dashboard screen:

Grafana-kubernetes-resources-dashboard-node-memory-usage-28

Next, we will create the Node CPU Utilization and Memory Utilization panels by updating Field A under the Metrics tab with the following metrics:

For Node CPU Utilization:

node:node_cpu_utilisation:avg1m

For Node Memory Utilization:

node:node_memory_utilisation:

Next, enter "{{node}}" into the Legend format field. Change the Unit under Axes tab to none>>percent(0.0-1.0) and update the panel names. This will create both the CPU and Mameory Utilization panels on the main dashboard screen under the Node section:

Grafana-kubernetes-resources-dashboard-node-cpu-memory-utilization-29

Adding Namespace Level Resource Metrics to the Grafana Kubernetes Dashboard

As with Pods, we will add four panels on a namespace level to our Kubernetes dashboard. These will include one each for CPU usage, CPU requests, Memory Usage and Memory requests.

Duplicate the CPU Template panels and enter the following expressions into Field A:

For Namespace CPU Usage:

sum(rate(container_cpu_usage_seconds_total{container_name!="POD", namespace!=""}[5m])) by (namespace)

For Namespace CPU Requests:

sum(kube_pod_container_resource_requests_cpu_cores{container_name!="POD", namespace!=""}) by (namespace)

Next, type "{{namespace}}" in the Legend format field.

Browse back to the main Grafana dashboard where you can now see both the Namespace CPU Usage and Request panels:

Grafana-kubernetes-resources-dashboard-namespace-cpu-usage-requests-30

Next, we will create the Namespace Memory Usage and Memory Requests dashboard panels. Update Field A for the duplicated Memory Template panels with the following expressions:

For Namespace Memory Usage:

sum(container_memory_usage_bytes{container_name!="POD",container_name!=""}) by (namespace)

For Namespace Memory Requests:

sum(kube_pod_container_resource_requests_memory_bytes) by (namespace)

Type "{{namespace}}" into the Legend format field.

Now we can see all four Namespace panels on the main dashboard screen:

Grafana-kubernetes-resources-dashboard-namespace-memory-usage-requests-31

Adding Cluster Level Resource Metrics to the Grafana Kubernetes Dashboard

Let’s now add resource metrics on the cluster level to our Grafana Kubernetes dashboard.

We will add four panels to this section of our dashboard; Cluster CPU utilization, memory utilization, and CPU and memory request commitments. CPU and memory request commitment metrics compare the sum of CPU and memory requests for all pods to the total capacity of the cluster.

Let’s start off with Cluster CPU and Memory Utilization. These resource metrics will be represented as percentages and will show us the extent to which the pods running on our nodes are utilizing the CPU or memory resources available on the node.

Duplicate the CPU and Memory Template panels, rename them and enter the following expressions into Field A:

For Cluster CPU Utilization:

:node_cpu_utilisation:avg1m

For Cluster Memory Utilization:

:node_memory_utilisation:

Under the Axes tab, choose percent (0.0-1.0) as the Unit and enter Cluster into the Legend format field.

Now save the dashboard and browse back to the main Grafana screen:

Grafana-kubernetes-resources-dashboard-cluster-cpu-memory-utilization-32

Now let’s add CPU and Memory Request commitments. We can do this by entering the following expressions into Field A of the duplicated CPU and Memory Template panels: 

For CPU Request Commitment

sum(kube_pod_container_resource_requests_cpu_cores) / sum(node:node_num_cpu:sum)

For Memory Request Commitment

sum(kube_pod_container_resource_requests_memory_bytes) / :node_memory_MemTotal:sum

Choose percent (0.0-1.0) as the Unit under the Axes tab and enter Cluster into the Legend format field.

Saving the dashboard and browsing back to the main view will show us all four Cluster level panels:

Grafana-kubernetes-resources-dashboard-cluster-cpu-memory-request-commitment-33

Conclusion

This post was a continuation of our “Monitoring Kubernetes resources in Production” blog series. In the first instalment we defined Kubernetes resources and identified resource metrics based on hardware and OS metrics, Kubernetes primitives and it’s resource management model. We also looked into the process of setting up Prometheus and outlined Prometheus expressions for those metrics.

In this post, we complemented the process of Kubernetes resource monitoring with Prometheus by installing Grafana. Grafana allows us to leverage the Prometheus data source and create information-rich graphs in a user-friendly visual format. We also created a Kubernetes dashboard based on the resource metrics we identified.

A resource monitoring architecture setup using Prometheus and Grafana, while giving us insight into resource consumption and utilization for native Kubernetes objects like pods, containers and namespaces does, however, fall short when it comes to custom groupings.

Custom groupings are usually driven by organizational imperatives and can include teams, applications, clients and departments etc. Understanding the consumption and utilization profiles of these custom groupings is important for ensuring efficient resource usage and future forecasting. Additionally, Prometheus and Grafana do not provide any cost related metrics; either for native Kubernetes objects or for custom groupings.

Replex’s dedicated Kubernetes solution provides consumption and utilization metrics across both native Kubernetes objects as well as custom groupings out of the box. It complements this with granular cost visibility and transparency, by co-relating resource consumption with the individual cost profiles of all infrastructure and abstraction layers.  

Replex’s optimize module also provides DevOps teams and IT managers actionable intelligence on optimizing infrastructure footprint. It does this by ingesting and analyzing historical and real-time consumption and utilization metrics. DevOps teams can then right-size their infrastructure leading to cost savings of up-to 30%. The optimize module can also be run in automated mode, where it automaically right-sizes infrastructure based on these signals.

Getting Ready for Kubernetes in Production? Download the Complete Production Readiness Checklist with Checks, Recipes and Best Practices for Availability, Security, Scalability, Resource Management and Monitoring

Download Checklist

Hasham Haider

Author

Hasham Haider

Fan of all things cloud, containers and micro-services!

Want to Dig Deeper and Understand How Different Teams or Applications are Driving Your Costs?

Request a quick 20 minute demo to see how you can seamlessly allocate Kubernetes costs while saving up to 30% on infrastructure costs using Replex.

Schedule a Meeting