I had been playing with Grafana, InfluxDB, Telegraf and Prometheus the last two weeks and can now share my story of what I finally use to replace my Netdata instances and why.
For many years, I used Munin Monitoring for monitoring my servers. It served me well, but was outdated, and you could only collect the most basic things (as long as you didn’t create your own collectors). Finding a replacement was hard, but I liked the idea of Netdata, since it was basically plug and play.
But then, shortly after, they tried to push you more and more to their cloud-based solution. I didn’t want that, neither from a cost-perspective, nor did I want to store my server data in the cloud. Later, they more and more limited the amount of servers you can view on the dashboard. And while the limit of 5 should be sufficient for me, it wasn’t due to a wrong implementation (I think). My quick workaround was to deploy multiple Netdata instances, while searching for a replacement.
This was when I started experimenting with Grafana as dashboard, InfluxDB as database and Telegraf for data collecting (which I always want to call Telefax, I don’t know why).
At first, I was very satisfied, since the Telegraf collector is also plug and play for me. Creating dashboards in Grafana was also not that hard, since InfluxDB can be used with SQL to request data.
It would have been so easy and I already created multiple dashboards for different services, added alerts and disabled Netdata completely, when I noticed unusual CPU usage spikes on my system running these monitoring systems. Unfortunately, every time Grafana requested data from InfluxDB (or InfluxDB pushes it, don’t know, didn’t check), the latter used non-reasonable amount of CPU cycles.
So I checked out Prometheus, which uses a different concept for collecting data with different exporters for different types of data, which all need to be setup individually. So the effort was higher, but it was worth the hustle. Prometheus uses near nothing CPU wise and is currently working with ~100 MiB RAM, where InfluxDB needed 1,6 GiB for the same amount of data being processed.

An additional plus: Since I needed to setup the exporters on multiple systems, I could play again with Ansible to configure it only once and then set it up automatically on every server.
Leave a Reply