Facebook releases its own OOM implementation
Contract Killer
When a Linux system runs out of memory, a special agent, the out-of-memory killer, rushes to its aid. Facebook has now introduced its own OOM killer. What makes it different from its kernel-based counterpart? And what is an OOM killer really?
If you have not placed an order for a large server for a long time, you will probably rub your eyes in amazement the next time you order a new device: Configurations with terabytes instead of gigabytes of RAM are easy to get, and you don't need to be a millionaire to buy them. Gone are the days when people were proud of every single gigabyte (Figure 1).
Some buyers don't even worry about RAM anymore and just assume the system will have enough; however, this might be a little too optimistic, even on a modern system. Servers still sometimes come up short on RAM, and when they do, it can have dramatic consequences: If a component such as systemd needs RAM and cannot allocate it, the system will malfunction or stop working. To avoid a RAM shortage bringing computers to their knees, the Linux kernel has a watchdog on board: the out-of-memory killer, or OOM killer for short. In an emergency, OOM frees up memory by shooting down processes in a targeted way; the memory is then available for other, presumably more important purposes.
Many legends and horror stories are centered on the OOM-killer, and the admin's sense of humor is typically strained when they see kernel messages in the log saying that the killer has struck again (Figure 2). The reason for the anxiety is that it is large applications, such as Java, that the OOM killer targets as its victims.
Java is not famed for being very sparing with resources, but it is usually necessary for running the application for which the server exists. If the OOM killer shoots down Java on a Tomcat system, a load balancer usually catches the problem, but the server taken out in this way is still gone at the end of the day.
This article introduces the current OOM implementation in Linux and explains how it works. I will then compare this standard implementation with an alternative approach chosen by Facebook.
How OOM Situations Occur
Even servers with huge amounts of RAM can get into situations where the available system RAM is not sufficient. This is because the Linux kernel uses certain ways and means to allocate memory as efficiently as possible. If you have ever called top
and looked at the RAM statistics, you will be aware that even on systems with a large amount of RAM and very little load, the display for RAM utilization is often close to the 100 percent limit, even if the system has nothing to do (Figure 3).
The Linux kernel is the interface between the hardware on one side and the programs on the other. If a program wants memory, it asks the kernel for it using a system call like malloc()
. However, it takes too long for the kernel to first search for free memory and then make the requested amount available.
Instead, the kernel preempts: It divides the entire available memory into segments, known as memory pages. In addition, the kernel remembers which pages are already assigned to the running programs and which are thus still available. If a program now comes along and uses RAM, the kernel simply assigns it a memory page from the list of free pages. Because the memory pages are not all the same size, the kernel also has a certain degree of flexibility and can ensure that there is not too much waste.
Waste Is Bad
It is important to avoid waste to the greatest extent possible. Even if you have an arbitrary amount of RAM at your disposal, you will still want to use it as well and efficiently as possible. For many years, the Linux kernel has supported a function that many admins consider equivalent to opening up the proverbial Pandora's box – overbooking RAM.
Roughly speaking, it works like this: The kernel assigns memory pages to requesting programs as usual, but more in total than would actually be available through the physically available working memory. This does not directly cause OOM problems – they are caused by programs that require too much RAM.
However, RAM overcommitment increases the risk of OOM situations because the kernel does not rigorously deal with potential difficulties in advance. If Linux did not allow applications to allocate more memory than actually exists, crashes due to a lack of memory would be unthinkable because applications would simply see an error message when they tried to claim more memory than available.
The Linux approach is different. The kernel speculates that allocated memory will never be fully used. The vm.overcommit_memory=sysctl
variable manages everything else: If it is set to
, which is the default value, the kernel uses a heuristic approach to calculate how much RAM is actually free. It then sets this in relation to the memory that a requesting application wants to have. If the calculations are positive, the program gets the memory, even if the amount of allocated memory becomes larger than the actual memory available in the system.
vm.overcommit_memory=1
makes the kernel even more radical: In this case, the kernel skips the heuristic analysis and approves every request for RAM. But if you set the value to 2
, RAM overbooking is switched off.
What Really Helps
If you think that it is sufficient to deactivate RAM overbooking on the basis of the previous explanations, you are wrong. The OOM problem is not caused by overbooking RAM, but by programs that continuously allocate too much RAM. And unfortunately, they usually do this unpredictably and for a variety of reasons. Often the root of the problem is simply a programming error, which causes the affected program to overburden the RAM. Occasionally, it actually happens that a system needs more RAM than is available to process incoming requests.
If you are confronted with OOM situations, you should first try very carefully to find the cause. If the emergency is not based on a programming error and the OOM situations occur regularly and reproducibly, the long-term solution can only be more hardware. You can either put more RAM into the affected servers or scale the setup horizontally.
If you are dealing with a programming error, it is a good idea to find it and repair it – in collaboration with the developers if necessary. Troubleshooting in such cases can be tough and time consuming. But if OOM problems occur after an update where there were none before, a bug is most likely the trigger.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Juno Tab 3 Launches with Ubuntu 24.04
Anyone looking for a full-blown Linux tablet need look no further. Juno has released the Tab 3.
-
New KDE Slimbook Plasma Available for Preorder
Powered by an AMD Ryzen CPU, the latest KDE Slimbook laptop is powerful enough for local AI tasks.
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.
-
Plasma Desktop Will Soon Ask for Donations
The next iteration of Plasma has reached the soft feature freeze for the 6.2 version and includes a feature that could be divisive.
-
Linux Market Share Hits New High
For the first time, the Linux market share has reached a new high for desktops, and the trend looks like it will continue.
-
LibreOffice 24.8 Delivers New Features
LibreOffice is often considered the de facto standard office suite for the Linux operating system.
-
Deepin 23 Offers Wayland Support and New AI Tool
Deepin has been considered one of the most beautiful desktop operating systems for a long time and the arrival of version 23 has bolstered that reputation.
-
CachyOS Adds Support for System76's COSMIC Desktop
The August 2024 release of CachyOS includes support for the COSMIC desktop as well as some important bits for video.
-
Linux Foundation Adopts OMI to Foster Ethical LLMs
The Open Model Initiative hopes to create community LLMs that rival proprietary models but avoid restrictive licensing that limits usage.
-
Ubuntu 24.10 to Include the Latest Linux Kernel
Ubuntu users have grown accustomed to their favorite distribution shipping with a kernel that's not quite as up-to-date as other distros but that changes with 24.10.