Facebook releases its own OOM implementation
Contract Killer

© Lead Image © efks, 123RF.com
When a Linux system runs out of memory, a special agent, the out-of-memory killer, rushes to its aid. Facebook has now introduced its own OOM killer. What makes it different from its kernel-based counterpart? And what is an OOM killer really?
If you have not placed an order for a large server for a long time, you will probably rub your eyes in amazement the next time you order a new device: Configurations with terabytes instead of gigabytes of RAM are easy to get, and you don't need to be a millionaire to buy them. Gone are the days when people were proud of every single gigabyte (Figure 1).
Some buyers don't even worry about RAM anymore and just assume the system will have enough; however, this might be a little too optimistic, even on a modern system. Servers still sometimes come up short on RAM, and when they do, it can have dramatic consequences: If a component such as systemd needs RAM and cannot allocate it, the system will malfunction or stop working. To avoid a RAM shortage bringing computers to their knees, the Linux kernel has a watchdog on board: the out-of-memory killer, or OOM killer for short. In an emergency, OOM frees up memory by shooting down processes in a targeted way; the memory is then available for other, presumably more important purposes.
Many legends and horror stories are centered on the OOM-killer, and the admin's sense of humor is typically strained when they see kernel messages in the log saying that the killer has struck again (Figure 2). The reason for the anxiety is that it is large applications, such as Java, that the OOM killer targets as its victims.

Java is not famed for being very sparing with resources, but it is usually necessary for running the application for which the server exists. If the OOM killer shoots down Java on a Tomcat system, a load balancer usually catches the problem, but the server taken out in this way is still gone at the end of the day.
This article introduces the current OOM implementation in Linux and explains how it works. I will then compare this standard implementation with an alternative approach chosen by Facebook.
How OOM Situations Occur
Even servers with huge amounts of RAM can get into situations where the available system RAM is not sufficient. This is because the Linux kernel uses certain ways and means to allocate memory as efficiently as possible. If you have ever called top
and looked at the RAM statistics, you will be aware that even on systems with a large amount of RAM and very little load, the display for RAM utilization is often close to the 100 percent limit, even if the system has nothing to do (Figure 3).

The Linux kernel is the interface between the hardware on one side and the programs on the other. If a program wants memory, it asks the kernel for it using a system call like malloc()
. However, it takes too long for the kernel to first search for free memory and then make the requested amount available.
Instead, the kernel preempts: It divides the entire available memory into segments, known as memory pages. In addition, the kernel remembers which pages are already assigned to the running programs and which are thus still available. If a program now comes along and uses RAM, the kernel simply assigns it a memory page from the list of free pages. Because the memory pages are not all the same size, the kernel also has a certain degree of flexibility and can ensure that there is not too much waste.
Waste Is Bad
It is important to avoid waste to the greatest extent possible. Even if you have an arbitrary amount of RAM at your disposal, you will still want to use it as well and efficiently as possible. For many years, the Linux kernel has supported a function that many admins consider equivalent to opening up the proverbial Pandora's box – overbooking RAM.
Roughly speaking, it works like this: The kernel assigns memory pages to requesting programs as usual, but more in total than would actually be available through the physically available working memory. This does not directly cause OOM problems – they are caused by programs that require too much RAM.
However, RAM overcommitment increases the risk of OOM situations because the kernel does not rigorously deal with potential difficulties in advance. If Linux did not allow applications to allocate more memory than actually exists, crashes due to a lack of memory would be unthinkable because applications would simply see an error message when they tried to claim more memory than available.
The Linux approach is different. The kernel speculates that allocated memory will never be fully used. The vm.overcommit_memory=sysctl
variable manages everything else: If it is set to
, which is the default value, the kernel uses a heuristic approach to calculate how much RAM is actually free. It then sets this in relation to the memory that a requesting application wants to have. If the calculations are positive, the program gets the memory, even if the amount of allocated memory becomes larger than the actual memory available in the system.
vm.overcommit_memory=1
makes the kernel even more radical: In this case, the kernel skips the heuristic analysis and approves every request for RAM. But if you set the value to 2
, RAM overbooking is switched off.
What Really Helps
If you think that it is sufficient to deactivate RAM overbooking on the basis of the previous explanations, you are wrong. The OOM problem is not caused by overbooking RAM, but by programs that continuously allocate too much RAM. And unfortunately, they usually do this unpredictably and for a variety of reasons. Often the root of the problem is simply a programming error, which causes the affected program to overburden the RAM. Occasionally, it actually happens that a system needs more RAM than is available to process incoming requests.
If you are confronted with OOM situations, you should first try very carefully to find the cause. If the emergency is not based on a programming error and the OOM situations occur regularly and reproducibly, the long-term solution can only be more hardware. You can either put more RAM into the affected servers or scale the setup horizontally.
If you are dealing with a programming error, it is a good idea to find it and repair it – in collaboration with the developers if necessary. Troubleshooting in such cases can be tough and time consuming. But if OOM problems occur after an update where there were none before, a bug is most likely the trigger.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Direct Download
Read full article as PDF:
Price $2.95
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Find SysAdmin Jobs
News
-
Kubuntu Focus Announces XE Gen 2 Linux Laptop
Another Kubuntu-based laptop has arrived to be your next ultra-portable powerhouse with a Linux heart.
-
MNT Seeks Financial Backing for New Seven-Inch Linux Laptop
MNT Pocket Reform is a tiny laptop that is modular, upgradable, recyclable, reusable, and ships with Debian Linux.
-
Ubuntu Flatpak Remix Adds Flatpak Support Preinstalled
If you're looking for a version of Ubuntu that includes Flatpak support out of the box, there's one clear option.
-
Gnome 44 Release Candidate Now Available
The Gnome 44 release candidate has officially arrived and adds a few changes into the mix.
-
Flathub Vying to Become the Standard Linux App Store
If the Flathub team has any say in the matter, their product will become the default tool for installing Linux apps in 2023.
-
Debian 12 to Ship with KDE Plasma 5.27
The Debian development team has shifted to the latest version of KDE for their testing branch.
-
Planet Computers Launches ARM-based Linux Desktop PCs
The firm that originally released a line of mobile keyboards has taken a different direction and has developed a new line of out-of-the-box mini Linux desktop computers.
-
Ubuntu No Longer Shipping with Flatpak
In a move that probably won’t come as a shock to many, Ubuntu and all of its official spins will no longer ship with Flatpak installed.
-
openSUSE Leap 15.5 Beta Now Available
The final version of the Leap 15 series of openSUSE is available for beta testing and offers only new software versions.
-
Linux Kernel 6.2 Released with New Hardware Support
Find out what's new in the most recent release from Linus Torvalds and the Linux kernel team.