Optimizing Prometheus and Grafana with the Prometheus Operator

April 22, 2021

AWS Foundations

Taking a proactive and efficient approach to Kubernetes cluster monitoring can help engineering teams identify and predict many critical problems like CPU...

Taking a proactive and efficient approach to Kubernetes cluster monitoring can help engineering teams identify and predict many critical problems like CPU outage, memory outage, storage issues well in advance of these issues taking a toll on a business. Companies of all sizes such as enterprises like CERN monitor petabytes of their Kubernetes cluster data to understand all their cluster workloads. Solving critical problems before they have the chance to make too significant an impact saves money, time, and reputation. The task is a challenge though as proper cluster monitoring can be a pain point for many companies as it’s important to be aware of what exactly we want to monitor in a cluster.

This article will discuss cluster monitoring fundamentals and how we can use Prometheus Operator to deploy Prometheus and Grafana to monitor a Kubernetes cluster.

What Is Cluster Monitoring?

Cluster monitoring is the process of monitoring all the components and resources running on a cluster. With this process, you actively check the health of all your services and applications and set up monitoring systems to send alerts to administrators to immediately notify them about problems. We can monitor CPU utilization, memory utilization, numbers of namespaces/pods/deployments/services running on the cluster, and many more resources.

Tools for Cluster Monitoring – Prometheus & Grafana

Prometheus and Grafana are two very popular tool choices for cluster monitoring.

Prometheus is an open-source monitoring system that collects the cluster data by sending HTTP requests to the metrics endpoints of the various resources running on the cluster. Prometheus stores data in a time-series database for analysis and alerting purposes.

Prometheus does generate raw visualizations of the metrics it collects. However, the final data images are not necessarily easy to navigate and understand. Optimizing Grafana to work alongside Prometheus allows you to combine the best features of both tools together. Grafana provides excellent cluster and data visualization images, plus the tool integrates with Prometheus seamlessly and generates beautiful dashboards for the cluster data.

Business Advantages of Cluster Monitoring

Cluster monitoring is crucial for any organization whose applications run on clusters. Any problem with the cluster can lead to a huge loss to the organization. For example, Moonlight had a 100% traffic outage due to their Kubernetes cluster issues.

Cluster monitoring:

Saves a lot of time and money for the organization by identifying critical issues in the cluster.
Helps in analyzing the cluster performance and measures critical information proactively.
Identifies and helps avoid any upcoming downtime due to bad cluster resource consumption.
Alerts the individual responsible in real-time about the problems in the cluster.
Can prevent or predict any massive issue which can bring down the cluster.
Maintains a pro-active health check on all the deployments and services.

Use Cases of Cluster Monitoring

We can curate and visualize cluster data for a better understanding of the cluster by selecting the desired metrics we want to monitor.
Cluster monitoring dashboards are easily shareable with the teams to share cluster insights with them.
We can run ad-hoc queries on the cluster monitoring tool to explore the cluster data. We can also explore data in different time ranges, data sources, queries.
Exploring logs is a fundamental use case for cluster monitoring which every administrator must do daily. We can also explore log metrics to understand data in detail that might not be visible in dashboards.
We can write our own conditions to generate alerts via email, chat tools like slack, webhook, etc., for critical cases.

Monitoring with Prometheus Operator

We can use Prometheus Operator to manage Prometheus-based Kubernetes monitoring stack by implementing the Kubernetes operator pattern. These Kubernetes operators configure, manage, and optimize the deployment on a Kubernetes cluster automatically. Prometheus Operator uses four custom resource definitions (CRDs) – Prometheus, ServiceMonitor, PrometheusRule, Alertmanager to act on. As the advantages of using the operator pattern for deploying and configuring Prometheus, Grafana, and Alertmanager have become clear, several companies have also made this easier by packaging Prometheus Operator using Helm to make it easier to deploy and manage, for example:

The Prometheus Operator entry on operatorhub.io originally written by the coreos team, and now maintained by Red Hat
The loki-stack helm charts created by the team at Grafana Labs can install the Prometheus Operator along with Promtail and Grafana Loki to give you a unified observability option for metrics-based monitoring as well as powerful consolidated and searchable access to your logs for your Kubernetes workloads.

Prometheus Operator also has a kube-prometheus repository which is a combination of Kubernetes manifests, Grafana dashboard templates, and pre-generated Prometheus rules which configure the Prometheus Operator to enabling monitoring, observability, and alerting for the Kubernetes Cluster itself. Kube Prometheus consists of the below packages in the monitoring stack:

The Prometheus Operator
Highly available Prometheus
Highly available Alertmanager
Prometheus node-exporter
Prometheus Adapter for Kubernetes Metrics APIs
kube-state-metrics
Grafana

Set-up Steps

Now, we will monitor a Kubernetes cluster with Prometheus Operator and visualize the monitoring components in Grafana. But you must have an up and running Kubernetes cluster before following the steps shown below.

Step 1: Clone Kube Prometheus from Prometheus operator git repository.

ubuntu@ubuntu:~$ git clone https://github.com/prometheus-operator/kube-prometheus
Receiving objects: 100% (11526/11526), 5.89 MiB | 3.33 MiB/s, done.
Resolving deltas: 100% (7136/7136), done.

Step 2: Using the configs present in the manifest directory, create the monitoring stack. This will create a lot of CRDs and a namespace – “monitoring”.

ubuntu@ubuntu:~$ cd kube-prometheus
ubuntu@ubuntu:~/kube-prometheus$ kubectl create -f manifests/setup
namespace/monitoring created
customresourcedefinition.apiextensions.k8s.io/alertmanagerconfigs.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com created
deployment.apps/prometheus-operator created
service/prometheus-operator created
serviceaccount/prometheus-operator created

ubuntu@ubuntu:~/kube-prometheus$ until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
No resources found
ubuntu@ubuntu:~/kube-prometheus$ kubectl create -f manifests/

Step 3: Check all the resources created for monitoring namespace. We can see multiple pods, daemonsets, services are now running on the cluster.

ubuntu@ubuntu:~/kube-prometheus$ kubectl get all -n monitoring

NAME                                       READY   STATUS    RESTARTS   AGE
pod/alertmanager-main-0                    2/2     Running   0          3m35s
pod/alertmanager-main-1                    2/2     Running   0          3m35s
pod/grafana-665447c488-9snqs               1/1     Running   0          3m32s
pod/kube-state-metrics-6f4dfb9ffb-g4gb7    3/3     Running   0          3m32s
pod/prometheus-k8s-0                       2/2     Running   1          3m30s
pod/prometheus-k8s-1                       2/2     Running   2          3m30s
pod/prometheus-operator-764cb46c94-jdd28   2/2     Running   0          5m1s

NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/alertmanager-main       ClusterIP   10.110.145.114   <none>        9093/TCP                     3m36s
service/alertmanager-operated   ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   3m35s
service/grafana                 ClusterIP   10.102.87.41     <none>        3000/TCP                     3m33s
service/kube-state-metrics      ClusterIP   None             <none>        8443/TCP,9443/TCP            3m33s
service/prometheus-operator     ClusterIP   None             <none>        8443/TCP                     5m2s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/grafana               1/1     1            1           3m33s
deployment.apps/kube-state-metrics    1/1     1            1           3m33s
deployment.apps/prometheus-adapter    1/1     1            1           3m31s
deployment.apps/prometheus-operator   1/1     1            1           5m3s

Step 4: If we go to the Kubernetes dashboard, we can see all the namespaces and custom resource definitions present on the cluster.

Step 5: Access the dashboard of Prometheus, Grafana using the below commands. Prometheus will be running on port 9090 and Grafana on 3000.

ubuntu@ubuntu:~/kube-prometheus$ kubectl --namespace monitoring port-forward svc/prometheus-k8s 9090
Forwarding from 127.0.0.1:9090 -> 9090

ubuntu@ubuntu:~/kube-prometheus$ kubectl --namespace monitoring port-forward svc/grafana 3000
Forwarding from 127.0.0.1:3000 -> 3000

Step 6: Monitor the cluster components and resources using Grafana.

Click on Manage.

Select the Default folder, you will get plenty of cluster resources to monitor. Choose the resources you want to monitor.

Finally, your cluster monitoring visualization will be ready.

In this snapshot, the Grafana dashboard monitors the cluster compute resources such as CPU utilization, memory limits, etc.

Conclusion

We hope this article helped you in understanding the importance of cluster monitoring and how Prometheus Operator can be the one-stop solution necessary to monitor your Kubernetes clusters with ease.

Caylent provides a critical DevOps-as-a-Service function to high growth companies looking for expert support with Kubernetes, cloud security, cloud infrastructure, and CI/CD pipelines. Our managed and consulting services are a more cost-effective option than hiring in-house, and we scale as your team and company grow. Check out some of the use cases, learn how we work with clients, and read more about our DevOps-as-a-Service offering.

AWS Foundations

Caylent Team

View Caylent's articles

Learn more about the services mentioned

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Why AWS Competencies Matter and Why They’re So Hard to Earn

Explore the difference between AWS certifications and competencies, why competencies are a more rigorous, experience-based validation of a partner’s ability to deliver real-world cloud solutions, and why they matter when choosing the right AWS Partner.

AWS Foundations

October 18, 2024

How To Use ParallelCluster for HPC on AWS: A Case Study

Explore how we helped our customer in the financial sector migrate from High-Performance Computing (HPC) workloads on an on-premise Slurm cluster to AWS ParallelCluster, detailing the process, challenges, and benefits.

Migrations

AWS Foundations

August 10, 2023

Top 7 Cloud Migration Mistakes

Migrating to the cloud is deeply desirable due to ease of the management, scalability and many other factors, however poor choices in the migration process can lead to increased costs, poor performance and tech debt. Learn about the top 7 cloud migration mistakes and how to avoid them.

AWS Foundations

Migrations

View all blog posts

What Is Cluster Monitoring?

Tools for Cluster Monitoring – Prometheus & Grafana

Business Advantages of Cluster Monitoring

Use Cases of Cluster Monitoring

Monitoring with Prometheus Operator

Set-up Steps

Conclusion

Caylent Team

Learn more about the services mentioned

Accelerate your cloud native journey

Related Blog Posts

Why AWS Competencies Matter and Why They’re So Hard to Earn

How To Use ParallelCluster for HPC on AWS: A Case Study

Top 7 Cloud Migration Mistakes