Stress testing for temperature on a home NAS

Temperature Watch

© Photo by Ilse Orsel on Unsplash

© Photo by Ilse Orsel on Unsplash

Article from Issue 254/2022
Author(s):

Using stress, lm-sensors, and hddtemp to sort out temperature and reliability related issues with a home-based NAS box.

I recently found myself in a difficult situation with my home NAS brought about by some sketchy construction work done at my apartment. Long story short, the workers didn't mention that they would be using my work area as their work area and set about cutting bathroom tiles right next to my main workstation, my home NAS server, and a toy i3-based rig that I occasionally use for testing and special projects. Consequently, my NAS suffered failure after failure – with ghosts that I am still chasing more than three months later.

My NAS is a home-built rig using a Supermicro X8STi with an X5680, 24GB of ECC DDR3, and six disks total, including the operating system (OS) drive itself. The OS is on an SSD, and the data is all on HDDs. There is no redundancy (see my article on MergerFS elsewhere in this issue). So, to be clear, this is a mess primarily of my own making, but one not without others at fault.

The first problem that I experienced after the aforementioned construction was that my disks would randomly disappear after having been present at boot. I cleaned the machine out as best as I could and reseated the SATA cables. This worked temporarily, but two of the drives would still drop off, requiring a reboot, and in some cases, requiring me to reseat the SATA cable. I should note that, at the time, I was using old inexpensive SATA cables that had been collected over the years from motherboard purchases, and so I decided it was time to switch to more modern cables with the locks that the SATA III specification requires. The problems remained.

At this point, having opened the box numerous times, I realized that the intake fan in front of the hard drive cage had failed. The fan was covered in red dust from the construction. Though I cannot say definitively that this was what caused its demise, it was followed very quickly by the CPU fan. Both were 120mm fans running full-tilt 24/7, and neither had filters or screens in front of them. The next item to go was the power button, oddly enough. Although I cannot assume that dust directly caused that failure, the power and reset buttons certainly got some additional mileage after the construction was completed as I chased these demons down. The case was an inexpensive affair made of cheap plastic and jagged steel, which seemed to take its final shape only once fully populated with gear, but I hadn't had any issues with it before the construction.

At this point, I decided that a cursory cleaning wasn't enough, and I really needed to rip everything out, clean everything off well, rewire it all, replace the faulty fans, and test. Needless to say, it is good that this is only a home NAS box and not something used in any sort of real production, as this is the exactly the type of scenario that is avoided at all costs in the server world.

After taking everything out and dusting it all as carefully as possible, it would have been sacrilege for me not to replace the CPU Thermal Interface Material (TIM) between the CPU lid and the heat sink. The X5680 is a 130W CPU and lets you know it when under load as its temperature ramps up quickly.

At this point in my epic saga, I have a freshly rebuilt box with known working components but without a feel for the performance and temperature with the new fans installed. I decided to use three simple and commonly-used programs to address these needs:

  • stress [1] (which is also available in an enhanced version known as stress-ng) is a workload generator that you can use to apply a configurable load to your system.
  • hddtemp [2] is a tool that will monitor the temperature of your hard drive to make sure the drive is operating in the recommended range. hddtemp works by accessing the Self-Monitoring Analysis and Reporting Technology (S.M.A.R.T.) information available with many hard drives.
  • lm-sensors [3] is a utility that can read and report data from sensors located in the hardware, including sensors for monitoring temperature, fan speed, and voltage for the CPU, mainboard, and other components.

These tools have all been around for a while, and you might already be familiar with them. What I decided to do was to run stress on all 12 threads of the CPU while at the same time having Plex add all of my preexisting media to the Plex library. This test is really the most use that those drives will ever get at one time. With the CPU working on all threads, the test would simulate not only the maximum power draw but also allow me to see how well the cheap 120mm knock-off fans perform.

Simple Stress Test

To install (on Ubuntu-based systems such as mine) and to make sure that you have the most recent versions of the programs, run the following commands in the terminal (Figure 1):

$ sudo apt update && sudo apt install hddtemp lm-sensors stress -y
$ sudo apt update && sudo apt upgrade -y
Figure 1: Installing stress testing and monitoring tools.

My system was running Ubuntu [4] 20.04.3 LTS and was up-to-date at the time of stress testing. The install only takes a short moment, and from there, all operations are done from the terminal, so you can use Ctrl+L to clear the screen and leave the terminal open to start the stressing. I would also recommend opening a second tab or another terminal instance for the monitoring. If you are connecting to a headless server via SSH or using some other terminal emulator, such as that which is found in Webmin [5] or Cockpit [6], you will need to adjust your workflow accordingly. At any rate, I found it was easier having two terminal windows open, one to run stress and to be able to stop the test from running using Ctrl+C, and the other to refresh the monitoring programs as needed.

To start running stress, I used the following command to load up all of the threads on the CPU:

$ stress --cpu 12

The output appears in Figure 2. Once the test had been running for a few minutes, I ran the following command in the second terminal window in order to see where the temps were:

$ sudo sensors && hddtemp /dev/sda /dev/sdb/dev/sdc /dev/sdd /dev/sde /dev/sdf
Figure 2: Running stress on all 12 threads of a hexa-core CPU with hyperthreading.

Figure 3 shows the result. You might need to run the following command first in order for it to know which sensors exist and which can be checked by the program itself:

$ sudo sensors-detect
Figure 3: Monitoring for temperatures with hddtemp and lm-sensors.

I don't know why I had to spell out each of the drives individually for hddtemp to work. Simply typing hddtemp should list the temperatures of all drives using the default Celsius scale, with each drive appearing on its own line, but that didn't work for me. Adding the /dev/sdX for each drive after the command did work and displayed each drive on its own line with temps shown in Celsius. Perhaps this had something to do with one drive being an SSD and rest being HDDs, though I doubt that was the cause. Perhaps it was because I use a cheap "RAID" card (it is not really a RAID card but rather an inexpensive Marvell chip with some SATA ports connected to it), which connects via PCIe and allows for four additional SATA III drives to be installed in the system.

For about an hour, I clicked the up arrow and ran the monitoring command every so often to make sure that the temps were OK. My CPU will run without a problem and boost normally up to around 81°C; during testing, mine peaked at about 75degreesC. Typical home-use HDDs, such as the WD Blues that I was using, shouldn't exceed 35-40degreesC, and mine hovered around 31degreesC during testing. Some folks will do stress testing for hours and even for over a day or more, but this is a home NAS with non-critical data stored on it in a temperature-controlled environment. The heat sink for the CPU was thoroughly heat-soaked after about 10 minutes of operation and the drives are located directly behind the previously mentioned 120mm intake fan, meaning they were receiving fresh (bathroom tile dust-free) air for the test.

Conclusion

After my experience with the recent construction in my home, I can pass on one very important lesson: make sure that your devices are covered and turned off if someone is doing construction nearby. Dust from construction is not great when run through your server, workstation, or laptop.

Running stress-testing software with a monitoring program is a great way to make sure that your system will stand the test of time and that your components will run optimally, especially after a new build or a rebuild. It is important to have a good baseline for your system to know if and when it is running well. One of the best things that I have found in the open source community is that there is a program, app, script, or Flatpak for just about anything you can imagine. With all of the issues that I have had to deal with pertaining to this NAS server, software was never an issue. Handy utilities like stress, hddtemp, and lm-sensors were easy to use and gave me important insights on my system.

It is easy to forget about the folks who have supported the open source community over the years, especially when you find yourself in situations that you would rather not be in and with hardware that seemed to have been cursed, but I can say, unequivocally, that I have far more faith in the contributors unknown to me who helped to make my NAS system work than I do in the construction workers I let into my house. For that, I would just like to say – very loudly, so that you can hear me over these cheap fans – thank you!

The Author

Adam Dix is a mechanical engineer and Linux enthusiast posing as an English teacher after playing around a bit in sales and marketing. You can check out some of his Linux work at EdUBudgie Linux (https://www.edubudgie.com).

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Monitoring Old Devices

    Create monitoring dashboards with SSH, command-line tools, and Node-RED.

  • Beat the Heat

    With lm-sensors, you can monitor your hardware's internal temperature to avoid overheating.

  • StressLinux

    StressLinux helps you optimize your hardware and eke out more performance from individual components.

  • Monitorix 1.0.0 Monitors Hardware and Software

    Version 1.0.0 of the Monitorix monitoring program has just been released. It now shows fan speeds, and CPU, motherboard and hard disk temperatures among other system data.

  • Ask Klaus!

    Klaus Knopper is the creator of Knoppix and co-founder of the LinuxTag expo. He currently works as a teacher, programmer, and consultant. If you have a configuration problem, or if you just want to learn more about how Linux works, send your questions to: klaus@linux-magazine.com

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

Subscribe to our Linux newsletters

News