A modern diff utility
Command Line – diffoscope

© Lead Image © Vlad Kochelaevskiy, 123RF.com
With support for more than 60 file formats, diffoscope extends the power of diff beyond the plain text or HTML file.
The first command in Unix-like systems for comparing files and directories was diff
. Originally written by Douglas McIlroy and first appearing in Unix 5th Edition in 1974, diff
rapidly became an essential programming tool. Today, the original command is still available, and most programming languages have their own versions of diff
. However, diff
and its derivatives generally have one limitation: With few exceptions, most of them work only with plain text or markup languages like HTML. A new variation called diffoscope [1], which was released in mid-2020, brings a new level of functionality to file comparison.
Diffoscope is developed primarily by Debian's Reproducible Builds project [2], which aims to increase the robustness and security of Debian packages by ensuring that they always build the same way. Given Debian's nearly 60,000 packages and the variety of hardware available, this is no small task, especially considering that small errors in code can be hard to trace. Diffoscope was written to make this task easier by quickly tracking down differences between two files that are supposed to be identical but perform differently. As a side effect, diffoscope provides a modern diff
utility that works across most programing languages and brings the power of diff
to desktop users and non-programmers, especially writers who wish to compare drafts. Already, diffoscope supports over 60 binary formats that range from files and filesystems to audio and text files, including MS Word, LibreOffice Writer, and PDF (Table 1). And more seem likely to follow.
Table 1
Supported Formats
Android APK files |
LLVM IR bitcode files |
Android boot images |
LZ4 compressed files |
ar(1) archives |
macOS binaries |
Berkeley DB database files |
Microsoft Windows icon files |
bzip2 archives |
Microsoft Word .docx files |
Character/block devices |
Mono Portable Executable files |
ColorSync color profiles (.icc) |
Multimedia metadata |
coreboot CBFS filesystem images |
OCaml interface files |
cpio archives |
Ogg Vorbis audio files |
Dalvik .dex files |
OpenOffice/LibreOffice .odt files |
Directories |
OpenSSH public keys |
Debian buildinfo files |
OpenWRT package archives (.ipk) |
Debian .changes files |
PDF documents |
Debian source packages (.dsc) |
PGP signatures |
Device Tree Compiler blob files |
PGP signed/encrypted messages |
ELF binaries |
PNG images |
ext2/ext3/ext4/Btrfs/FAT filesystems |
PostScript documents |
freedesktop.org fontconfig cache files |
RPM archives |
Free Pascal files (.ppu) |
Rust object files (.deflate) |
gettext message catalogs |
SQLite databases |
GHC Haskell .hi files |
SquashFS filesystems |
GIF image files |
Statically linked binaries |
Git repositories |
Symlinks |
GNU R database files (.rdb) |
Tape archives (.tar) |
GNU R Rscript files (.rds) |
tcpdump capture files (.pcap) |
Gnumeric spreadsheets |
Text files |
Gzipped files |
TrueType font files |
ISO 9660 CD images |
WebAssembly binary module |
Java .class files |
XML binary schemas (.xsb) |
JavaScript files |
XML files |
JPEG images |
XZ compressed files |
JSON files |
Diffoscope's basic command structure is:
diffoscope FILE1 FILE2
If only one file or directory is given, then diffoscope attempts to compare the given file with the last file compared – a desperate act that will only occasionally be useful. For convenience, the command can be piped through less or more. You might also add the --progress
option for large files like DVD images. If you are dealing with large files, you might also run up against the built-in limits for output. Rather than resetting them, you can cancel all of them with the option --no-default-limits
.
Output is to standard output by default, but you can also save to file. The output shows the content of the first file in red text, with each line prefaced by a minus sign, and the content of the second file in white text prefaced by a plus sign. At the top of the output, you'll find statistics that vary with the file type. For example, in Figure 1, the files share LibreOffice's .odt
format, and the statistics are the file names, the amount of text in each file that differs, and the number of total words in each file. By contrast, in Figure 2, a directory diff
is prefaced by file listings, file permissions, and other attributes. The output is driven by context, ensuring that it is useful for more than the diff
itself.
Output Formatting Options
Besides standard input, diffoscope's output can be saved to several file formats. To write output to a text file, add the option --text OUTPUT-FILE
, giving the full path. You can also color-code an output text file with --text-color WHEN
, replacing when with never
, auto
, or always
. Color is enabled automatically in standard output, but disabled by default when you write to a file. Similarly, an HTML file is named with --html OUTPUT-FILE
. Color is not supported for HTML files, but you can write a multi-HTML file using --html OUTPUT-DIRECTORY
, so you can absorb the output in small chunks, and --css URL
to format the output as desired. If you are using JavaScript, both text and HTML output can be formatted using --jquery URL
. Other supported file format options are --json OUTPUT-FILE
, --markdown OUTPUT-FILE
, and --restructured-text OUTPUT_FILE
, all three of which can be used for either files or for standard output. In all these formats, --output-empty
can be used to write a file to report no differences.
Output Limit Options
Coming from an era of memory limitations, diff
is economical, by default writing just a few lines so that the context of a difference can be read. By contrast, diffoscope, written in mid-2020 has limits that are so high that, for all practical purposes, it often has no limits. Instead, if you want to limit diffoscope's output – perhaps to make the output more manageable – you have to deliberately add limits. The number of bytes in an output report is unlimited by default, but you can use --max-text-report-size BYTES
to define a limit. Alternatively, you can use --max-text-report-size BYTES
to change the default of 409,600, or, if using --html OUTPUT-DIRECTORY
, you can use --max-page-size-child BYTES
to change the size of the separate pages of an HTML report from the default of 204,800. Still another alternative is to change the default 1,024 lines for a unified-diff
block – that is, for separate chunks of the report. These options are primarily for comparisons of long files, such as .iso
images, and are generally irrelevant when dealing with files in MS Word or LibreOffice format unless you are comparing complete manuscripts.
Difference Calculation Options
A number of options modify how diffoscope makes its comparisons. --exclude GLOB_PATTERN
and --exclude-command REGEX_PATTERN
are different names for the same option and can be used with either files or directories. When working with directories, you can set whether permissions and other file attributes are used with --exclude-directory-metadata SETTING
, which can be completed with auto, yes, no, or recursive. In addition, you can opt to enable fuzzy logic, controlling how minor differences are handled. A setting of
means that all matches must be exact; however, the meaning of the default of 60
or the maximum of 400
has to be discovered through trial and error, since it is currently undocumented.
Other options are reminiscent of diff
itself setting the number of lines to compare. Use --max-diff-input-lines LINES
to compare the number of lines (the maximum is 4,194,304). You can also set the maximum number of lines per diff
block with --max-diff-block-lines-saved LINES
.
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.
News
-
Fedora 39 Beta is Now Available for Testing
For fans and users of Fedora Linux, the first beta of release 39 is now available, which is a minor upgrade but does include GNOME 45.
-
Fedora Linux 40 to Drop X11 for KDE Plasma
When Fedora 40 arrives in 2024, there will be a few big changes coming, especially for the KDE Plasma option.
-
Real-Time Ubuntu Available in AWS Marketplace
Anyone looking for a Linux distribution for real-time processing could do a whole lot worse than Real-Time Ubuntu.
-
KSMBD Finally Reaches a Stable State
For those who've been looking forward to the first release of KSMBD, after two years it's no longer considered experimental.
-
Nitrux 3.0.0 Has Been Released
The latest version of Nitrux brings plenty of innovation and fresh apps to the table.
-
Linux From Scratch 12.0 Now Available
If you're looking to roll your own Linux distribution, the latest version of Linux From Scratch is now available with plenty of updates.
-
Linux Kernel 6.5 Has Been Released
The newest Linux kernel, version 6.5, now includes initial support for two very exciting features.
-
UbuntuDDE 23.04 Now Available
A new version of the UbuntuDDE remix has finally arrived with all the updates from the Deepin desktop and everything that comes with the Ubuntu 23.04 base.
-
Star Labs Reveals a New Surface-Like Linux Tablet
If you've ever wanted a tablet that rivals the MS Surface, you're in luck as Star Labs has created such a device.
-
SUSE Going Private (Again)
The company behind SUSE Linux Enterprise, Rancher, and NeuVector recently announced that Marcel LUX III SARL (Marcel), its majority shareholder, intends to delist it from the Frankfurt Stock Exchange by way of a merger.