Your NAS isn't enough – you still need to back up your data!

Not All NAS

© Lead Image © bram janssens, 123RF.com

© Lead Image © bram janssens, 123RF.com

Author(s):

Some users trust their data to powerful file servers that advertise enterprise data protection, but your Network Attached Storage system might not be as safe as you think it is.

There is a point in the life of a compulsive data hoarder when a regular computer is not enough to contain a burgeoning file collection. Upon the relentless expansion of a massive data compilation, the first step a home user takes to extend the storage capacity is to purchase an external USB hard drive. The hard drive will buy the user some time, but eventually this solution will fall short. A data hoarder who is dedicated enough will eventually have to invest in a Network Attached Storage (NAS) server.

A NAS is a dedicated server optimized to store large amounts of information. NAS servers are commonly available as commercial appliances, but many power users prefer to build their own from spare parts. Serious NAS servers are scalable and allowed to increase their capacity by adding hard drives as needed. Better yet, they often offer enterprise features that come in very handy, and they promise mitigations to the most common threats against the long term survival of your files.

NAS vendors often advertise fault tolerance and profess the immunity of their systems from disaster, which causes users to treat this sort of storage as bulletproof, dumping their data and then skipping the step of making backups. But rarely do these consumer-grade storage systems provide a complete solution. This article describes some of the things that can go wrong – and why you still need to perform backups to ensure that your data is safe.

The Features of a Quality NAS

A wide range of NAS options are available for home users. These options vary in quality from desktop toys to quasi-enterprise systems trying to pass as domestic appliances (Figure 1).

Figure 1: The TrueNas Mini E is a popular NAS appliance. It features 8GB of ECC RAM (which is upgradeable to 16GB) and four hot-swap bays for hard drives.

With the exception of the low end ones, NAS boxes are designed with the purpose of offering the highest possible availability. In this context, a high availability machine is one that can keep serving its users under adverse conditions. Such a server needs to be able to keep functioning if a hard drive fails, if the power grid blacks out, or if its power supply malfunctions.

Servers mitigate hard drive failures by the use of Redundant Array of Independent Disks (RAID). A RAID group is just a set of hard drives that are recognized as a single virtual drive by the operating system. (See the box entitled "Popular RAID Levels" for more information on some common RAID scenarios.) In a domestic NAS context, these drives will most often be grouped in the so called RAID 5 level. RAID 5 distributes the data within the array evenly across every device, with some extra parity components. Should one of the drives fail, the server will keep functioning in a degraded state by keeping the remaining drives running and using the parity data to reconstruct lost information.

Popular RAID Levels

RAIDs can be built in multiple ways, depending on the purpose they serve. The most popular traditional RAID levels are:

  • RAID 0 stripes data across all the drives in the set for increased performance (Figure 2). The total size of the RAID is that of the sum of the sizes of every individual drive. A disk failure kills the array, making it a dangerous RAID level to use. RAID 0 has better read and write throughput than a single hard drive of the same size as the array, because the workload is evenly distributed over the individual drives in the RAID.
Figure 2: RAID 0 distributes the data across the drives of the array. This configuration is good for performance, but losing a single drive destroys the whole array.
  • RAID 1 mirrors the data across all the drives in the array (Figure 3). Since every drive has a full copy of all the data, a RAID 1 can keep working as long as one of its drives is still operational. RAID 1 is good for keeping a proper uptime, but it is not very cost effective, because, at the very least, it takes twice as many drives for the same storage capacity.
Figure 3: RAID 1 ensures that the data is mirrored from one drive to the other. As long as there is a functioning drive, the array will keep working, but this configuration is not cost effective.
  • RAID 5 is among the most popular in small deployments. This form of RAID is known as disk striping with parity. The disks are striped (as with RAID 0), but an additional drive provides a parity bit, ensuring that the array can keep working if one of the drives fails (Figure 4). RAID 6 does pretty much the same thing, except it can keep working after two hard drive failures.
Figure 4: In a RAID 5 configuration, data is distributed evenly across all the drives of the array, alongside a small amount of parity information, in such a way that the server hosting the array may keep functioning if one of the drives fails.
  • RAID 10 is a combination of RAID 0 and RAID 1. Drives are deployed in couples in which each unit mirrors the other. Then all the pairs are placed in a RAID 0 (Figure 5). RAID 10 can keep functioning as long as at least one drive in each pair is in working order.
Figure 5: RAID 10 places RAID 1 pairs within a RAID 0. This configuration is very fault tolerant but also very expensive.

A server can survive blackouts by the use of an Uninterrupted Power Supply (UPS), which is just a fancy term for a battery that kicks in when the power grid goes down (Figure 6). A modern UPS can communicate with the server over USB or Ethernet in order to let the operating system know how much power is left in the battery, which is useful to force the machine to shutdown in an orderly way when the supply is about to run dry.

Figure 6: File servers are often paired with an Uninterrupted Power Supply system, such as this CyberPower unit. This device will prevent an unclean shutdown in case of blackout.

About ECC

Good NAS hardware will often feature Error Correction Code (ECC) RAM. ECC RAM is capable of checking itself for consistency against random errors in memory, which are more frequent than it seems [1]. RAM errors are considered dangerous for the survival of a dataset and the continued operation of a server. A botched bit in RAM could cause the operating system to malfunction or cause a file to get corrupted. ECC is intended to reduce the risk of such an event and keep the system running after a memory error.

A theory holds that a bit error in RAM could cause a chain reaction, resulting in massive data corruption within a ZFS filesystem. It is therefore argued that the only safe way of running a ZFS server is with ECC RAM, and that doing otherwise is borderline suicidal.

ZFS uses no pre-mount consistency checker and lacks filesystem repair tools at the time of this writing. ZFS was conceived as a self-healing filesystem, capable of repairing data corruption on the go. Should ZFS try to read a data block that has been corrupted by, let's say, a hard drive defect, the filesystem would be able to identify the issue and attempt to repair it on the fly from parity data. Such self-healing features do, in theory, eliminate the need for recovery tools. The FreeNAS project (now TrueNAS) used to warn that a botched memory operation could cause permanent damage to the filesystem, and since there are no recovery tools available, data could end up being unrecoverable [2].

However, opinions differ on whether ZFS is more susceptible to failure than other filesystems. Matthew Ahrens, cofounder of Sun's ZFS project, argues that using ZFS with non-ECC RAM is about as risky as running a regular filesystem without it [3], arguing that ECC RAM is not necessary but is highly recommended.

RAID Issues

A good NAS promises excellent uptime and looks indestructible on the surface. It would seem like files should be able to survive indefinitely in such a server. After all, if a NAS is capable of withstanding a hard drive failure (the most common hardware malfunction [4]), there is not much incentive for spending the big amount of money required to set another server up and keeping a backup of the original one.

The problem is that there is only so much a file server can do to protect your data, especially outside of an enterprise environment. Quality server hardware is designed to guarantee good uptime in the face of trouble, but not necessarily the integrity of your information. There are a number of reasons why a NAS may still fail.

If a hard drive fails within a NAS' RAID 5 set, the whole array will work at a degraded level. From the user viewpoint, the array is still operational, but it has ceased to offer fault tolerance. Should another drive fail before a new one is added and the array is rebuilt, the information contained in the array will be lost. Many a RAID array has failed due to owner procrastination – or due to the long wait time waiting for the attention of an overworked sys admin.

But tardy repair is just one of the reasons why some experts are wary of depending on RAID. A casual search on the Internet will find countless opinions regarding the unsuitability of RAID 5 for modern file servers [5]. Storage media is not perfect and may suffer random read failures. Hard Drives are reliable enough for most purposes [6], but every now and then they will throw an Unrecoverable Read Error (URE). UREs are errors which take place when the hard drive tries to access a block of data and fails to do so. Modern drives are estimated to suffer an URE for every 10^14 bits read on average, which means errors are rare.

The bigger a disk array, the higher the chance that a defective sector exists somewhere. The argument of RAID 5 detractors is that disk arrays are becoming so big that the probability of triggering a URE is becoming too high to be acceptable. This is so because the more bits are managed by the RAID, the more likely it is that at least one block of information is problematic.

If a RAID 5 loses a drive to hardware failure, a new drive can be plugged in, and the RAID 5 may be rebuilt from the data existing in the remaining disks. However, if any of the remaining disks throws a URE during this process, the consequences may range from losing the data existing in that sector to being unable to rebuild the whole RAID (depending on the quality of the RAID controller and drives).

Experience suggests that the fear of being unable to rebuild big arrays is blown out of proportion. Nevertheless, it is important to remember that RAID 5 is a tool for guaranteeing uptime rather than the integrity of your files.

There are RAID levels with better fault tolerance than RAID 5 (such as RAID 6 or RAID 10) but using these alternative RAID levels in a small system is comparatively expensive.

Nearly as bad as this is the fact that many RAID controllers are proprietary and don't offer a good migration path. If you are using a proprietary solution and want to move your hard drives from an old server – maybe because the old one finally bit the dust! – you might discover that your data is unreadable in its destination machine.

On the other hand, software issues might destroy your files just as quickly as a hardware level malfunction, and using an enterprise-grade server won't do much for you if you are hit by a bug. For example, QNAP's NAS appliances were massively affected by a vulnerability that caused many users to be preyed on by the DeadBolt ransomware [7][8].

Power Failure

Modern filesystems are moderately resistant to power failure, but even the mighty ZFS could suffer from a blackout [9]. A UPS will help, but beware of cheap units: Many budget domestic UPS are not prepared to handle continuous operation and will wear out, eventually bringing down the NAS with them. According to a Ponemon Institute's 2016 survey, UPS failure is the top cause of unplanned data center outages [10]. What this means in practice is that blackout protection reduces the risk of suffering data loss from power loss, but it does not remove the threat entirely.

In enterprise scenarios, administrators are aware that trying to make a NAS bulletproof is not enough to guarantee true high availability. In practice, the enterprise uses Storage Area Networks (SAN) or distributed filesystems such as Ceph [11]. Such tools are deployed in computer clusters, in such a way that if a server goes down, the rest of the cluster remains operational.

The minimal (and, for serious purposes, insufficient) storage cluster that can be deployed is described in Figure 7. This is known as a Primary-Replica topology, in which the primary performs services for the clients. The replica's contents are periodically synchronized with the primary's. Should the primary go down, the load balancer will promote the replica and turn it into the new primary (Figure 8).

Figure 7: A naive high-availability cluster. A load balancer directs all traffic to a file server designated as the primary. The file server designated as a replica contains a copy of the primary's contents.
Figure 8: If the primary server goes offline, the replica is promoted to primary and the traffic is transferred to it.

The Cloud Option

Real life high-availability systems are not something you are likely to be able to run at home: typically they feature redundant load balancers and might require some Border Gateway Protocol (BGP) magic thrown in. Even the naive and simple method I just described multiplies the cost of the storage by more than two, because it requires a redundant server and a load balancer (at which point you are likely to need a server rack in a server room).

It is therefore not a surprise that many users, especially small businesses, turn to professional storage vendors, who offer cloud storage for a fee and take care of ensuring the storage systems are perpetually available. Professional storage vendors might also be very cost effective. For example, cloud storage might cost you around $1,500 in four years, which is less than what you are likely to spend on a good NAS. As I assume a NAS is likely to need an upgrade around the fourth year, the cloud option is not entirely unreasonable. Sadly, storage vendors come with their own issues: Uploading your data to them can take much longer than uploading it to a local server, and some vendor environments might present privacy concerns.

Humans and Software

Even if you were to assume that your chosen storage solution is completely indestructible, it would still not eliminate the need for a proper backup system. If you manually delete a file by mistake, or if you lost the file to a software bug or malware, it makes no difference whether it was stored on a regular laptop, a high-end NAS, or a cloud storage provider. Experience shows that human mistakes force you to restore from backups much more often than hardware failures. Certain storage vendors know this and keep a historical registry of every file uploaded to them, so you can retrieve an old version of a file if you discover you have uploaded a corrupt version or deleted something important by accident. Interestingly, the vendor is actually running a backup policy for you.

Conclusion

A high-availability system is designed to serve its users even if issues such as hardware failure or power loss affect it. A side effect from a high-availability setup is that information that would have been lost from a failure in a non-redundant system may survive if it is managed by a storage cluster or even a high-end domestic NAS.

On the other hand, high-end storage systems can only protect your data so much. As shown in this article, solutions designed to keep a storage system running in the face of adversity might fail to guarantee the integrity of the data. After all, their primary concern is to maintain the continuity of the service, not to protect the information stored inside.

For this reason, it is advisable to maintain proper backup for your data, even if you keep it in a NAS server that looks impervious to the typical threats against data integrity. Quality storage decreases the probability of suffering data loss, but does not remove it.

Infos

  1. Schroeder, B., E. Pinheiro, and W. Weber. "DRAM Errors in the Wild: A Large-Scale Field Study." In: Proceedings of SIGMETRICS '09, (SIGMETRICS, 2009), http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
  2. "A Complete Guide to FreeNAS Hardware Design, Part I: Purpose and Best Practices" by Joshua Paetzel, February 3, 2015, https://web.archive.org/web/20151122065016/http://www.freenas.org/whats-new/2015/02/a-complete-guide-to-freenas-hardware-design-part-i-purpose-and-best-practices.html
  3. ZFS and ECC RAM: https://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=26303271#p26303271
  4. Common hardware malfunctions: https://blog.storagecraft.com/hardware-failure
  5. Unrecoverable errors in RAID 5:http://raidtips.com/raid5-ure.aspx
  6. Backblaze drive stats for Q1 2021: https://www.backblaze.com/blog/backblaze-hard-drive-stats-q1-2021/
  7. QTS and QuTS hero vulnerability: https://www.qnap.com/en/security-advisory/qsa-21-57
  8. DeadBolt: https://www.qnap.com/en/security-advisory/QSA-22-02
  9. ZFS and power failures: https://www.klennet.com/notes/2021-04-26-zfs-and-power-failures.aspx
  10. Cost of data center outages: https://www.ponemon.org/research/ponemon-library/security/2016-cost-of-data-center-outages.html
  11. Cepth: https://ceph.io/en/

The Author

Rubén Llorente is a mechanical engineer who ensures that the IT security measures for a small clinic are both legally compliant and safe. In addition, he is an OpenBSD enthusiast and a weapons collector.