Avoiding data corruption in backups
Integrity Check
A backup policy can protect your data from malware attacks and system crashes, but first you need to ensure that you are backing up uncorrupted data.
Most home users, and I dare say some system administrators, lack a backup policy. Their family pictures, music collections, and customer data files live on their hard drives and are never backed up to offline storage, only to be lost when the hard drive eventually crashes. A few users know to keep backups and regularly copy their files over to a safe storage medium. But even these conscientious people may find their strategy lacking when the time comes to recover from a system crash and they discover corrupted backup data. A successful backup strategy must involve checking for corrupted data.
Silent Data Corruption
The small number of users who keep copies of their important files only keep a single backup. Often, they use an external storage system, such as Tarsnap or a Nextcloud instance, periodically or continuously synchronizing the important files on their computers with the cloud. While a comfortable approach for end users, a single backup suffers a number of problems. Most importantly, single backups are vulnerable to silent data corruption.
Take for example a folder called Foals
, which is full of pictures of happy young horses. My backup strategy consists of weekly copying the entire folder over to USB mass storage with a tool such as rsync
[1]:
$ rsync -a --delete --checksum Foals/ /path/to/usb/
The rsync
tool synchronizes the contents of /path/to/usb
with the contents of Foals
.
This strategy works until one of the pictures in Foals
gets corrupted. Files get damaged for a number of reasons, such as a filesystem failing to recover properly after an unclean shutdown. Files also may be lost because of human error – you intend to delete Foals/10_foal.jpg
but end up removing Foals/01_foal.jpg
instead without realizing the mistake. If a file gets corrupted or lost and you don't detect the issue before the next backup cycle, rsync
will overwrite the good copy in USB storage with bad data. At this point, all the good copies of the data cease to exist, destroyed by the very backup system intended to protect them.
To mitigate this threat, you can establish a long term storage policy for backups which involves saving your backup to a different folder each week within the USB mass storage. I could therefore keep a current backup of Foals
in a folder called Foals_2022-01-30
, an older backup in Foals_2022-01-23
, and so on. When the backup storage becomes full, I could just delete the older folders to make room for the newer ones. With this strategy, if data corruption happens and it takes me a week to discover it, I may be able to dig up good copies of the files from an older snapshot (Figure 1). See the boxout "The rsync Time Machine" for instructions on how to set up this multi-week backup system.
The rsync Time Machine
With rsync
, you can save backups to a directly attached drive or over a network. As an added convenience, the snapshot of the folder that rsync
takes does not take much space on your storage device.
Suppose I have an external drive mounted under /mnt
. The first snapshot would be saved with a regular invocation of rsync
:
$ mkdir /mnt/Foals_2022-01-23 $ rsync -a Foals/ /mnt/Foals_2022-01-23
The first command creates a directory with a name reflecting the date. The second command copies Foals
to the newly created directory. The -a
switch instructs rsync
to work in "archival" mode, recursively descending into subdirectories, preserving symlinks, time metadata, file permissions, and file ownership data.
When the time comes to make another weekly backup, I create a different backup folder (which references the new current date) and copy Foals
to it. However, rsync
has a trick up its sleeve: The --link-dest
switch tells rsync
to transfer only the changes since the last backup:
$ mkdir /mnt/Foals_2022-01-30 $ rsync -a --link-dest /mnt/Foals_2022-01-23 Foals/ /mnt/Foals_2022-01-30
As a result, rsync
copies any new file to the new backup directory, alongside any file that has been modified since the last backup. Files that have been deleted from the source directory are not copied. For files that exist in the source directory but have not been modified since the last backup, rsync
creates a hard link to the unmodified files' respective copies in the old backup directory rather than copying them to the new backup directory.
The end result is that Foals_2022-01-23
contains a copy of Foals
as it was on that date, while Foals_2022-01-30
contains a current snapshot of Foals
. Because only modified or new files are added to the storage medium, they barely take up any extra space. Everything else is included in the new backup folder via hard links.
Unfortunately, long term storage only works if the data corruption is discovered in time. If your storage medium only has room for storing four snapshots, a particular version of a file will only exist in the backup for four weeks. On the fifth week, the oldest snapshot will be deleted in order to make room for new copies. If the data corruption is not detected within this time window, the good copies of the data will be gone and you will no longer be able to retrieve them from a backup.
Solving for Silent Data Corruption
The first step in guaranteeing a good backup is to verify that you are backing up only uncorrupted data, which is easier said than done. Fortunately, a number of tools exist to help you preserve your data integrity.
Filesystems with checksum support (such as ZFS) offer a reasonable degree of protection against corruption derived from hardware errors. A checksum function takes data, such as a message or a file, and generates a string of text from it. As long as the function is passed the same data, it will generate the same string. If the data gets corrupted in the slightest, the generated string will be different.
ZFS [2], in particular, can verify if a data block is correct upon reading it. If it is not (e.g., as a result of a hard drive defect), ZFS either repairs the data block or throws an error for the user to see.
However, ZFS cannot protect data against human error: If you delete a file by accident with
rm Foals/01_foal.jpg
ZFS has no way of knowing this is a mistake instead of a legitimate operation. If a bogus image editor accidentally damages the picture using valid system calls, ZFS can not differentiate changes caused by software bugs from changes intended by the user. While ZFS is often praised as the ultimate guarantee for data integrity, its impressive capabilities fall short in my opinion.
Protection from Userspace
To verify that the data being backed up is correct, I suggest relying on userspace utilities. While many userspace programs are superb at locating damaged files, they are not easily executable from an arbitrary recovery environment. In a system crash scenario, you may find yourself using something like an obsolete SystemRescue DVD (perhaps from an old Linux Magazine) instead of your normal platform. In keeping with the KISS principle, you should choose userspace tools that are portable and easy to use from any platform.
If your distribution includes the GNU coreutils package (which the vast majority do), you need no fancy tooling.
Ideally, you should verify the files' integrity immediately before the backup is performed. The simplest way of ensuring a given file has not been modified, accidentally or otherwise, is to calculate its checksum and compare the result with the checksum it threw from a known good state (Figure 2). Thus, the first step towards protecting a given folder against corruption is by calculating the checksum of every file in the folder:
$ cd Foals $ find . -type f ! -name '*.md5' -print0 | xargs -0 md5sum | sort -k 2 > md5sums_`date -I`.md5
(See the "Creating a Checksum" box for a more detailed explanation.)
Creating a Checksum
Calculating a checksum is not intuitive, so I will break down the command and explain how it works its magic.
The find
command locates any file (but not directories) in the current folder, excluding files with the .md5
extension. It prints a list of the found files to the standard output. The path of each file is null terminated in order to avoid security issues (which could be derived from piping paths with special characters into the next command):
find . -type f ! -name '*.md5' -print0
Then xargs
just accepts the list provided by the find
command and passes it to the md5sum
program, which generates a checksum for every entry in the list. The -0
switch tells xargs
that find
is passing null-terminated paths to it:
xargs -0 md5sum
The sort
command orders the list (because find
is not guaranteed to deliver sorted results). The output of md5sums
has two columns: The second column contains the path of each file; the first contains its corresponding checksum. Therefore, I pass the -k 2
switch to sort
in order to sort the list using the path names as a criteria:
sort -k 2
These commands create a list of all the files in the Foals
directory, alongside its md5
checksums, and places it under Foals
. The file will have a name dependent on the current date (such as md5sums_2022-01-23.md5
).
If a week later I want to verify that the files are fine, I can issue the same command to generate a new list. Then, it would be easy to check the differences between the state of the Foals
folder on the previous date and the state of the Foals
folder on the current date with the following command:
$ diff md5sums_2022-01-23.md5 md5sums_2022-01-30.md5
The diff
command generates a list of differences between the two files, which will make it easy to spot which files have been changed, added, or removed from Foals
(Figure 3). If a file has been damaged, this command will expose the difference.
Using diff
is only practical if the dataset is small. If you are backing up several files, there are better ways to check that your data is not corrupted. For instance, you can use grep
to list the entries that exist in the old checksum file but not in the new one. In other words: grep
will list the files that have been modified or removed since the last time you performed a check:
$ grep -Fvf md5sums_2022-01-30.md5 md5sums_2022-01-23.md5
The -f md5sums_2022-01-30.md5
option instructs grep
to treat every line of md5sums_2022-01-30.md5
as a target pattern. Any line in md5sums_2022-01-23.md5
that coincides with any of these patterns will be regarded as a match. The -F
option forces grep
to consider patterns as fixed, instead of as regular expressions. Therefore, for a match to be registered, it must be exact. Finally, -v
inverts the matching: Only lines from md5sums_2022-01-23.md5
that match no pattern will be printed.
You can also list the files that have been added since the check was last run with the shell magic in Listing 1.
Listing 1
Newly Added Files
awk '{print $2}' < md5sums_2022-01-30.md5 | while read -r file; do if (! grep $file md5sums_2022-01-23.md5 > /dev/null); then echo "$file is new."; fi; done
With these tools, an integrity verification policy falls into place. In order to ensure you don't populate your backups with corrupted files, you must do the following:
- Generate a list of the files in the dataset and its checksums before initiating the backup.
- Verify this list against the list you generated at the last known good state.
- Identify which changes have happened between the last known good state and the current state, and check if they suggest data corruption.
- If the data is good, back up your files.
A great advantage of this method is that the checksum files can be used to verify the integrity of the backups themselves. For example, if you dumped the backup to /mnt/Foals_2022-01-23
, you could just use a command such as:
$ cd /mnt/Foals_2022-01-23 $ md5sum --quiet -c md5sums_2022-01-23.md5
If any file was missing from the backup or had been modified, this command would reveal the issue right away.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Thousands of Linux Servers Infected with Stealth Malware Since 2021
Perfctl is capable of remaining undetected, which makes it dangerous and hard to mitigate.
-
Halcyon Creates Anti-Ransomware Protection for Linux
As more Linux systems are targeted by ransomware, Halcyon is stepping up its protection.
-
Valve and Arch Linux Announce Collaboration
Valve and Arch have come together for two projects that will have a serious impact on the Linux distribution.
-
Hacker Successfully Runs Linux on a CPU from the Early ‘70s
From the office of "Look what I can do," Dmitry Grinberg was able to get Linux running on a processor that was created in 1971.
-
OSI and LPI Form Strategic Alliance
With a goal of strengthening Linux and open source communities, this new alliance aims to nurture the growth of more highly skilled professionals.
-
Fedora 41 Beta Available with Some Interesting Additions
If you're a Fedora fan, you'll be excited to hear the beta version of the latest release is now available for testing and includes plenty of updates.
-
AlmaLinux Unveils New Hardware Certification Process
The AlmaLinux Hardware Certification Program run by the Certification Special Interest Group (SIG) aims to ensure seamless compatibility between AlmaLinux and a wide range of hardware configurations.
-
Wind River Introduces eLxr Pro Linux Solution
eLxr Pro offers an end-to-end Linux solution backed by expert commercial support.
-
Juno Tab 3 Launches with Ubuntu 24.04
Anyone looking for a full-blown Linux tablet need look no further. Juno has released the Tab 3.
-
New KDE Slimbook Plasma Available for Preorder
Powered by an AMD Ryzen CPU, the latest KDE Slimbook laptop is powerful enough for local AI tasks.