04.22.21

Optimizing Prometheus and Grafana with the Prometheus Operator

By Kevin Taylor
Optimizing Prometheus and Grafana with the #PrometheusOperator

Taking a proactive and efficient approach to Kubernetes cluster monitoring can help engineering teams identify and predict many critical problems like CPU outage, memory outage, storage issues well in advance of these issues taking a toll on a business. Companies of all sizes such as enterprises like CERN monitor petabytes of their Kubernetes cluster data to understand all their cluster workloads. Solving critical problems before they have the chance to make too significant an impact saves money, time, and reputation. The task is a challenge though as proper cluster monitoring can be a pain point for many companies as it’s important to be aware of what exactly we want to monitor in a cluster.

This article will discuss cluster monitoring fundamentals and how we can use Prometheus Operator to deploy Prometheus and Grafana to monitor a Kubernetes cluster.

What Is Cluster Monitoring?

Cluster monitoring is the process of monitoring all the components and resources running on a cluster. With this process, you actively check the health of all your services and applications and set up monitoring systems to send alerts to administrators to immediately notify them about problems. We can monitor CPU utilization, memory utilization, numbers of namespaces/pods/deployments/services running on the cluster, and many more resources.

Tools for Cluster Monitoring – Prometheus & Grafana

Prometheus and Grafana are two very popular tool choices for cluster monitoring.

Prometheus is an open-source monitoring system that collects the cluster data by sending HTTP requests to the metrics endpoints of the various resources running on the cluster. Prometheus stores data in a time-series database for analysis and alerting purposes.

Prometheus does generate raw visualizations of the metrics it collects. However, the final data images are not necessarily easy to navigate and understand. Optimizing Grafana to work alongside Prometheus allows you to combine the best features of both tools together. Grafana provides excellent cluster and data visualization images, plus the tool integrates with Prometheus seamlessly and generates beautiful dashboards for the cluster data.

Business Advantages of Cluster Monitoring

Cluster monitoring is crucial for any organization whose applications run on clusters. Any problem with the cluster can lead to a huge loss to the organization. For example, Moonlight had a 100% traffic outage due to their Kubernetes cluster issues. 

Cluster monitoring:

  • Saves a lot of time and money for the organization by identifying critical issues in the cluster.
  • Helps in analyzing the cluster performance and measures critical information proactively.
  • Identifies and helps avoid any upcoming downtime due to bad cluster resource consumption.
  • Alerts the individual responsible in real-time about the problems in the cluster.
  • Can prevent or predict any massive issue which can bring down the cluster.
  • Maintains a pro-active health check on all the deployments and services.

Use Cases of Cluster Monitoring

  • We can curate and visualize cluster data for a better understanding of the cluster by selecting the desired metrics we want to monitor.
  • Cluster monitoring dashboards are easily shareable with the teams to share cluster insights with them.
  • We can run ad-hoc queries on the cluster monitoring tool to explore the cluster data. We can also explore data in different time ranges, data sources, queries.
  • Exploring logs is a fundamental use case for cluster monitoring which every administrator must do daily. We can also explore log metrics to understand data in detail that might not be visible in dashboards.
  • We can write our own conditions to generate alerts via email, chat tools like slack, webhook, etc., for critical cases.

Monitoring with Prometheus Operator

We can use Prometheus Operator to manage Prometheus-based Kubernetes monitoring stack by implementing the Kubernetes operator pattern. These Kubernetes operators configure, manage, and optimize the deployment on a Kubernetes cluster automatically. Prometheus Operator uses four custom resource definitions (CRDs) – Prometheus, ServiceMonitor, PrometheusRule, Alertmanager to act on. As the advantages of using the operator pattern for deploying and configuring Prometheus, Grafana, and Alertmanager have become clear, several companies have also made this easier by packaging Prometheus Operator using Helm to make it easier to deploy and manage, for example:

  • The Prometheus Operator entry on operatorhub.io originally written by the coreos team, and now maintained by Red Hat 
  • The loki-stack helm charts created by the team at Grafana Labs can install the Prometheus Operator along with Promtail and Grafana Loki to give you a unified observability option for metrics-based monitoring as well as powerful consolidated and searchable access to your logs for your Kubernetes workloads.

Prometheus Operator also has a kube-prometheus repository which is a combination of Kubernetes manifests, Grafana dashboard templates, and pre-generated Prometheus rules which configure the Prometheus Operator to enabling monitoring, observability, and alerting for the Kubernetes Cluster itself. Kube Prometheus consists of the below packages in the monitoring stack:

  • The Prometheus Operator
  • Highly available Prometheus
  • Highly available Alertmanager
  • Prometheus node-exporter
  • Prometheus Adapter for Kubernetes Metrics APIs
  • kube-state-metrics
  • Grafana

Set-up Steps 

Now, we will monitor a Kubernetes cluster with Prometheus Operator and visualize the monitoring components in Grafana. But you must have an up and running Kubernetes cluster before following the steps shown below.

Step 1: Clone Kube Prometheus from Prometheus operator git repository.

ubuntu@ubuntu:~$ git clone https://github.com/prometheus-operator/kube-prometheus
Receiving objects: 100% (11526/11526), 5.89 MiB | 3.33 MiB/s, done.
Resolving deltas: 100% (7136/7136), done.

Step 2: Using the configs present in the manifest directory, create the monitoring stack. This will create a lot of CRDs and a namespace – “monitoring”.

ubuntu@ubuntu:~$ cd kube-prometheus
ubuntu@ubuntu:~/kube-prometheus$ kubectl create -f manifests/setup
namespace/monitoring created
customresourcedefinition.apiextensions.k8s.io/alertmanagerconfigs.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com created
deployment.apps/prometheus-operator created
service/prometheus-operator created
serviceaccount/prometheus-operator created

ubuntu@ubuntu:~/kube-prometheus$ until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
No resources found
ubuntu@ubuntu:~/kube-prometheus$ kubectl create -f manifests/

Step 3: Check all the resources created for monitoring namespace. We can see multiple pods, daemonsets, services are now running on the cluster.

ubuntu@ubuntu:~/kube-prometheus$ kubectl get all -n monitoring

NAME                                       READY   STATUS    RESTARTS   AGE
pod/alertmanager-main-0                    2/2     Running   0          3m35s
pod/alertmanager-main-1                    2/2     Running   0          3m35s
pod/grafana-665447c488-9snqs               1/1     Running   0          3m32s
pod/kube-state-metrics-6f4dfb9ffb-g4gb7    3/3     Running   0          3m32s
pod/prometheus-k8s-0                       2/2     Running   1          3m30s
pod/prometheus-k8s-1                       2/2     Running   2          3m30s
pod/prometheus-operator-764cb46c94-jdd28   2/2     Running   0          5m1s

NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/alertmanager-main       ClusterIP   10.110.145.114   <none>        9093/TCP                     3m36s
service/alertmanager-operated   ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   3m35s
service/grafana                 ClusterIP   10.102.87.41     <none>        3000/TCP                     3m33s
service/kube-state-metrics      ClusterIP   None             <none>        8443/TCP,9443/TCP            3m33s
service/prometheus-operator     ClusterIP   None             <none>        8443/TCP                     5m2s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/grafana               1/1     1            1           3m33s
deployment.apps/kube-state-metrics    1/1     1            1           3m33s
deployment.apps/prometheus-adapter    1/1     1            1           3m31s
deployment.apps/prometheus-operator   1/1     1            1           5m3s

Step 4: If we go to the Kubernetes dashboard, we can see all the namespaces and custom resource definitions present on the cluster.

Prometheus namespaces
Prometheus custom resource definitions

Step 5: Access the dashboard of Prometheus, Grafana using the below commands. Prometheus will be running on port 9090 and Grafana on 3000.  

ubuntu@ubuntu:~/kube-prometheus$ kubectl --namespace monitoring port-forward svc/prometheus-k8s 9090
Forwarding from 127.0.0.1:9090 -> 9090
Prometheus expression
ubuntu@ubuntu:~/kube-prometheus$ kubectl --namespace monitoring port-forward svc/grafana 3000
Forwarding from 127.0.0.1:3000 -> 3000
welcome to Grafana

Step 6: Monitor the cluster components and resources using Grafana.

Click on Manage.

Grafana dashboard

Select the Default folder, you will get plenty of cluster resources to monitor.  Choose the resources you want to monitor.

cluster resource monitoring

Finally, your cluster monitoring visualization will be ready.

In this snapshot, the Grafana dashboard monitors the cluster compute resources such as CPU utilization, memory limits, etc.

cluster compute monitoring on Grafana

In this snapshot, the dashboard is monitoring the bandwidth used on the Kubernetes cluster.

bandwidth monitoring on Grafana

Conclusion

We hope this article helped you in understanding the importance of cluster monitoring and how Prometheus Operator can be the one-stop solution necessary to monitor your Kubernetes clusters with ease.


Caylent provides a critical DevOps-as-a-Service function to high growth companies looking for expert support with Kubernetes, cloud security, cloud infrastructure, and CI/CD pipelines. Our managed and consulting services are a more cost-effective option than hiring in-house, and we scale as your team and company grow. Check out some of the use cases, learn how we work with clients, and read more about our DevOps-as-a-Service offering.