Monitor resource contention with Pressure Stall Information

Memory and I/O

The two other files, memory and io, each return two lines. The first line starts with some; the second with full. The some values show the portion of time in which at least one process is stalled, and the full values show the time in which all non-idle processes are stalled simultaneously. According to the documentation at the Kernel.org site, the full state means that "…actual CPU cycles are going to waste, and the workload that spends extended time in this state is considered to be thrashing." Listing 3 shows an example of a 2-socket compute node with an AMD EPYC 7551 and a total of 128 threads.

Listing 3

Measuring with memory and io

$ grep -R . /proc/pressure/
/proc/pressure/io:some avg10=0.00 avg60=0.00 avg300=0.00 total=10587199096
/proc/pressure/io:full avg10=0.00 avg60=0.00 avg300=0.00 total=10072568253
/proc/pressure/cpu:some avg10=30.27 avg60=29.97 avg300=18.80 total=1620253162
/proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=15411
/proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=12389
$ uptime
07:24:59 up 2 days, 16:15,  1 user,  load average: 150.58, 118.00, 76.42

A large full value in memory can mean that the system was unable to handle a single runnable process in this time and that the CPU was probably busy paging. The overloaded backup server in Listing 4 illustrates this nicely. In this example, logging onto the system with SSH took more than a minute.

Listing 4

Overloaded Backup Server

$ grep -R . /proc/pressure/
/proc/pressure/io:some avg10=15.60 avg60=11.13 avg300=7.98 total=94192093351
/proc/pressure/io:full avg10=15.60 avg60=11.13 avg300=7.97 total=93713900789
/proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=1159442298
/proc/pressure/memory:some avg10=67.79 avg60=67.80 avg300=72.51 total=618948360599
/proc/pressure/memory:full avg10=67.60 avg60=67.58 avg300=72.18 total=613900281165

Polling

The Linux PSI interface lets admins generate triggers by writing them to the files and then reading them with poll(). Listing 5 breaks down the syntax; the values for the stall amount and the time window are in microseconds.

Listing 5

Polling Syntax

some|full Stall_Amount Time_Window

Listing 6 shows an example of a monitoring program from the Linux documentation [4]. The program defines an event that sends notifications if a process fails to receive RAM resources for more than 150 milliseconds within a one-second time interval. If you name the file, say, psi_example.c, you can build it easily by typing make psi_example, assuming you have the build tools in place.

Listing 6

psi_example.c

01 #include <errno.h>
02 #include <fcntl.h>
03 #include <stdio.h>
04 #include <poll.h>
05 #include <string.h>
06 #include <unistd.h>
07 /*
08  * Monitor memory partial stall with 1s tracking
09  * window size and 150ms threshold.
10  */
11 int main() {
12   const char trig[] = "some 150000 1000000";
13   struct pollfd fds;
14   int n;
15   fds.fd = open("/proc/pressure/memory",
16                  O_RDWR | O_NONBLOCK);
17   if (fds.fd < 0) {
18     printf("/proc/pressure/memory open error: %s\n",
19             strerror(errno));
20     return 1;
21   }
22   fds.events = POLLPRI;
23   if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
24     printf("/proc/pressure/memory write error: %s\n",
25             strerror(errno));
26     return 1;
27   }
28   printf("waiting for events...\n");
29   while (1) {
30     n = poll(&fds, 1, -1);
31     if (n < 0) {
32       printf("poll error: %s\n", strerror(errno));
33       return 1;
34     }
35     if (fds.revents & POLLERR) {
36       printf("got POLLERR, event source is gone\n");
37       return 0;
38     }
39     if (fds.revents & POLLPRI) {
40       printf("event triggered!\n");
41     } else {
42       printf("unknown event received: 0x%x\n",
43               fds.revents);
44       return 1;
45     }
46   }
47   return 0;
48 }

Conclusions

PSIs compressed to only one or two lines inform the admin about resource bottlenecks [5]. The file-based interface makes it easy to integrate scripts and helps to build monitoring systems. Even external system monitoring tools such as Atop already integrate PSI (Figure 2).

Figure 2: Version 2.4.0 and newer versions of Atop also show the PSI.

Thanks to the integration of PSI in Cgroups, admins receive this information globally for the entire system and in a granular form. PSI provides admins with a powerful alternative to the load average for a better overview of resource bottlenecks.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • System Diagnosis Tools

    To check on the health of a Linux system, administrators can turn to vmstat, iostat, netstat, and ifstat. Or, you can just use the versatile dstat, which combines the features of several tools in a single package.

  • OOM Killer

    When a Linux system runs out of memory, a special agent, the out-of-memory killer, rushes to its aid. Facebook has now introduced its own OOM killer. What makes it different from its kernel-based counterpart? And what is an OOM killer really?

  • Command Line – Probing /proc

    The mysterious /proc virtual filesystem is a rich mine of information about everything in your system.

  • Load Average

    What is the real meaning of those little “load average” values in the output of shell commands like procinfo and uptime, and what can you do with these numbers?

  • Exploring /proc

    The Linux /proc virtual filesystem offers a window into a running system – look inside for information on processes and kernel activity.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News