System monitoring for a new generation with Prometheus

Big Watcher

Article from Issue 186/2016

Author(s): Martin Loschwitz

Legacy monitoring solutions are fine for small-to-medium-sized networks, but complex environments benefit from a different approach. Prometheus is an interesting alternative to classic tools like Nagios.

Where monitoring is required, alerting and trending are never far away. Alerting plays a major role in practically any monitoring environment; the idea is to draw the administrator's attention to failures. And, trending is also important. Trending helps the admin detect potential bottlenecks at an early stage.

A quick look at the available monitoring solutions shows why Monitoring, Alerting, and Trending (MAT) are still an issue for many networks, particularly large and complex networks. Nagios, which has dominated the monitoring market for a long time, is a behemoth of complexity and comes with some inherent weaknesses.

Nagios alternatives such as Icinga have attempted to address some of the issues, but their scalability is limited. The ballast of compatibility with Nagios and its plugins aggravates the situation. A state-of-art feature like trending was not exactly designed into the legacy Nagios. PNP4Nagios [1], a performance-tracking Nagios add-on, is one of the few options for useful trending with Nagios (Figure 1).

Figure 1: Legacy solutions such as PNP4Nagios generate heavy load when computing graphs and still take a huge amount of time – especially if you need to map longer periods of time.

SoundCloud as the Precursor

SoundCloud from UK was confronted with the challenge of implementing a monitoring solution. The company operates a streaming service along the lines of Spotify or Apple Music. The real challenge from the outset was to build a MAT system that would work reliably with thousands of nodes. Instead of combining existing components to create a better-than-nothing solution, SoundCloud decided to explore unknown territory. The company chose to develop its own monitoring system and the result was Prometheus [2].

Compared with established solutions like Nagios, Prometheus has one very special feature: It comes with its own storage system to manage the data acquired from the network. Prometheus' internal database is based on the concept of the time series database. And, Prometheus tends to think more in the dimension of a complete metric rather than focusing on individual alerts. To understand what that means, I will take a short detour into the storage universe.

How MAT Systems Manage Data

Classic monitoring systems, such as Nagios, do not have very sophisticated data management, and they don't actually need it. The important thing with monitoring is whether a service is running properly right now. When you add the topic of trending, things start to become more difficult: Trending means you need long-term records relating to the availability of the service or the load on the existing infrastructure.

PNP4Nagios, for example, supports a database such as MySQL in the background in order to store the required values for a long period. MySQL is actually not designed for this kind of use, which can lead to problems. The volume of data you need to manage will grow extremely quickly in any large installation. The persistent storage on which all your trending data resides thus needs to scale just as easily as the entire platform. This is particularly true of the storage, but it also applies to the way in which the database handles a continuously increasing volume of data.

Also, preparing the data is a challenge: the data reaches the MAT system sorted in order of time, but at the other end, you'll need to output the data to reflect specific services. For example: the MAT system is regularly supplied with data points from its target systems for various services in consecutive order, such as "9AM: CPU load 1, RAM utilization 30 percent, and disc space usage 15 percent." However, administrators will typically want to know what the CPU load looked like in a specific period, for example between 9AM today and the same time the previous morning.

Storing and manipulating large amounts of data in a database is an extremely resource-hungry process, and MySQL, in particular, loves taking its time with queries from tools like PNP4Nagios. A time-series database, such as the database used with Prometheus, offers an alternative approach.

Basically, a time series database is no more than a database that is designed for storing data in temporal relation. (See the box titled "Not the First, But the Best.") The data is converted by algorithms directly in the database. Prometheus is thus better equipped to take on a complex task such as trending thanks to its data model.

Not the First, But the Best

Prometheus is not the first attempt to apply the time-series database model to network monitoring. Graphite [3] was around long before Prometheus, but its data model is not as mature. Influx DB [4], which is typically combined with a frontend such as Sensu, is even younger than Prometheus, but it addresses a different user group and, according to our tests, doesn't scale as well Prometheus when faced with large volumes of data. And, then there is OpenTSDB [5], the Open Time Series Database, which fundamentally is very similar to Prometheus but requires external add-on components such as Hadoop. The fact that these external constraints do not apply to Prometheus is something that many admins really appreciate about the product.

Typical monitoring and alerting is then no more than a side product: If no results are received for a specific metric over a period of time, the system assumes the service is not running correctly and sounds the alarm.

Prometheus Modular Architecture

Under the hood, Prometheus relies on a modular architecture. The core of the application – that is, the time series database – is programmed in Go, just like most of the applications in the Prometheus distribution. The database comes with its own web interface and a separate tool for alert management (the Alert Manager). Exporters for the target host are important – exporter is basically another word for agent: The node exporter, for example, logs various data for metrics such as CPU load or RAM usage on the host on which it is running, giving the Prometheus database the ability to pull this data when needed. If the service needs to push its data to the MAT system, you can deploy the push gateway, which fields the data from the services and stages it for the database.

At the heart of the system is the Prometheus server (Figure 2). The server handles many tasks, the most important of which is storing the measurement data acquired in the cloud. Although Prometheus comes from the cloud camp, the service is lagging behind in scalability. Although you can easily run any number of Prometheus instances within the same setup, in contrast to many other solutions, Prometheus does not rely on shared storage on the back end.

Figure 2: The hub of the Prometheus system is the Prometheus server, which communicates with all the other components.

The Prometheus developers cite complexity as a reason for avoiding shared storage. They mention their competitor OpenTSDB as a negative example. Many admins would love to deploy OpenTSDB, but they are put off by the enormous overhead of running a complete Hadoop cluster.

Instead, Prometheus relies on the sharding principle. You can configure multiple instances of the Prometheus server service to cover overlapping data areas. Before performing a search, the database determines the shard in which the data in question must reside and it only looks there.

At this level, you can replicate by letting logical pairs of servers collect the data from the same agent on the network. A record is thus available multiple times and still usable in scenarios in which one of the two nodes has failed.

The Prometheus developers are aware that there is a problem with this lack of a shared storage alternative. Right now, they are working on a solution that generates a superordinate instance for a cluster of Prometheus installations; the instance, in turn, picks up the data from the Prometheus shards.

This approach gives users centralized administration. And there are plans for the distant future: In the long term, the intent is for Prometheus to store data in OpenTSDB – and thus leverage its replication capabilities.

1 2 3 4 Next »

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Canonical Releases Ubuntu 24.04

Gnome , Linux , open source , Ubuntu

After a brief pause because of the XZ vulnerability, Ubuntu 24.04 is now available for install.
Linux Servers Targeted by Akira Ransomware

Enterprise Linux , Linux , ransomware , Security

A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

Games , Hardware , laptop , Linux

This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
XZ Gets the All-Clear

Arch Linux , Fedora , Linux , open source , Security , Ubuntu

The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
Canonical Collaborates with Qualcomm on New Venture

Artificial Inte... , Linux , open source , Security , Ubuntu

This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
Kodi 21.0 Open-Source Entertainment Hub Released

audio , Multimedia , Music , open source , streaming video , Video

After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
Linux Usage Increases in Two Key Areas

Games , Linux , open source , Steam

If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
Vulnerability Discovered in xz Libraries

Fedora , Linux , malware , Security

An urgent alert for Fedora 40 has been posted and users should pay attention.
Canonical Bumps LTS Support to 12 years

Linux , open source , Operating Systems , Ubuntu

If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
Fedora 40 Beta Released Soon

Fedora , Gnome , open source , Plasma , Wayland

With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.

System monitoring for a new generation with Prometheus