A modern diff utility
Command Line – diffoscope
With support for more than 60 file formats, diffoscope extends the power of diff beyond the plain text or HTML file.
The first command in Unix-like systems for comparing files and directories was diff
. Originally written by Douglas McIlroy and first appearing in Unix 5th Edition in 1974, diff
rapidly became an essential programming tool. Today, the original command is still available, and most programming languages have their own versions of diff
. However, diff
and its derivatives generally have one limitation: With few exceptions, most of them work only with plain text or markup languages like HTML. A new variation called diffoscope [1], which was released in mid-2020, brings a new level of functionality to file comparison.
Diffoscope is developed primarily by Debian's Reproducible Builds project [2], which aims to increase the robustness and security of Debian packages by ensuring that they always build the same way. Given Debian's nearly 60,000 packages and the variety of hardware available, this is no small task, especially considering that small errors in code can be hard to trace. Diffoscope was written to make this task easier by quickly tracking down differences between two files that are supposed to be identical but perform differently. As a side effect, diffoscope provides a modern diff
utility that works across most programing languages and brings the power of diff
to desktop users and non-programmers, especially writers who wish to compare drafts. Already, diffoscope supports over 60 binary formats that range from files and filesystems to audio and text files, including MS Word, LibreOffice Writer, and PDF (Table 1). And more seem likely to follow.
Table 1
Supported Formats
Android APK files |
LLVM IR bitcode files |
Android boot images |
LZ4 compressed files |
ar(1) archives |
macOS binaries |
Berkeley DB database files |
Microsoft Windows icon files |
bzip2 archives |
Microsoft Word .docx files |
Character/block devices |
Mono Portable Executable files |
ColorSync color profiles (.icc) |
Multimedia metadata |
coreboot CBFS filesystem images |
OCaml interface files |
cpio archives |
Ogg Vorbis audio files |
Dalvik .dex files |
OpenOffice/LibreOffice .odt files |
Directories |
OpenSSH public keys |
Debian buildinfo files |
OpenWRT package archives (.ipk) |
Debian .changes files |
PDF documents |
Debian source packages (.dsc) |
PGP signatures |
Device Tree Compiler blob files |
PGP signed/encrypted messages |
ELF binaries |
PNG images |
ext2/ext3/ext4/Btrfs/FAT filesystems |
PostScript documents |
freedesktop.org fontconfig cache files |
RPM archives |
Free Pascal files (.ppu) |
Rust object files (.deflate) |
gettext message catalogs |
SQLite databases |
GHC Haskell .hi files |
SquashFS filesystems |
GIF image files |
Statically linked binaries |
Git repositories |
Symlinks |
GNU R database files (.rdb) |
Tape archives (.tar) |
GNU R Rscript files (.rds) |
tcpdump capture files (.pcap) |
Gnumeric spreadsheets |
Text files |
Gzipped files |
TrueType font files |
ISO 9660 CD images |
WebAssembly binary module |
Java .class files |
XML binary schemas (.xsb) |
JavaScript files |
XML files |
JPEG images |
XZ compressed files |
JSON files |
Diffoscope's basic command structure is:
diffoscope FILE1 FILE2
If only one file or directory is given, then diffoscope attempts to compare the given file with the last file compared – a desperate act that will only occasionally be useful. For convenience, the command can be piped through less or more. You might also add the --progress
option for large files like DVD images. If you are dealing with large files, you might also run up against the built-in limits for output. Rather than resetting them, you can cancel all of them with the option --no-default-limits
.
Output is to standard output by default, but you can also save to file. The output shows the content of the first file in red text, with each line prefaced by a minus sign, and the content of the second file in white text prefaced by a plus sign. At the top of the output, you'll find statistics that vary with the file type. For example, in Figure 1, the files share LibreOffice's .odt
format, and the statistics are the file names, the amount of text in each file that differs, and the number of total words in each file. By contrast, in Figure 2, a directory diff
is prefaced by file listings, file permissions, and other attributes. The output is driven by context, ensuring that it is useful for more than the diff
itself.
Output Formatting Options
Besides standard input, diffoscope's output can be saved to several file formats. To write output to a text file, add the option --text OUTPUT-FILE
, giving the full path. You can also color-code an output text file with --text-color WHEN
, replacing when with never
, auto
, or always
. Color is enabled automatically in standard output, but disabled by default when you write to a file. Similarly, an HTML file is named with --html OUTPUT-FILE
. Color is not supported for HTML files, but you can write a multi-HTML file using --html OUTPUT-DIRECTORY
, so you can absorb the output in small chunks, and --css URL
to format the output as desired. If you are using JavaScript, both text and HTML output can be formatted using --jquery URL
. Other supported file format options are --json OUTPUT-FILE
, --markdown OUTPUT-FILE
, and --restructured-text OUTPUT_FILE
, all three of which can be used for either files or for standard output. In all these formats, --output-empty
can be used to write a file to report no differences.
Output Limit Options
Coming from an era of memory limitations, diff
is economical, by default writing just a few lines so that the context of a difference can be read. By contrast, diffoscope, written in mid-2020 has limits that are so high that, for all practical purposes, it often has no limits. Instead, if you want to limit diffoscope's output – perhaps to make the output more manageable – you have to deliberately add limits. The number of bytes in an output report is unlimited by default, but you can use --max-text-report-size BYTES
to define a limit. Alternatively, you can use --max-text-report-size BYTES
to change the default of 409,600, or, if using --html OUTPUT-DIRECTORY
, you can use --max-page-size-child BYTES
to change the size of the separate pages of an HTML report from the default of 204,800. Still another alternative is to change the default 1,024 lines for a unified-diff
block – that is, for separate chunks of the report. These options are primarily for comparisons of long files, such as .iso
images, and are generally irrelevant when dealing with files in MS Word or LibreOffice format unless you are comparing complete manuscripts.
Difference Calculation Options
A number of options modify how diffoscope makes its comparisons. --exclude GLOB_PATTERN
and --exclude-command REGEX_PATTERN
are different names for the same option and can be used with either files or directories. When working with directories, you can set whether permissions and other file attributes are used with --exclude-directory-metadata SETTING
, which can be completed with auto, yes, no, or recursive. In addition, you can opt to enable fuzzy logic, controlling how minor differences are handled. A setting of
means that all matches must be exact; however, the meaning of the default of 60
or the maximum of 400
has to be discovered through trial and error, since it is currently undocumented.
Other options are reminiscent of diff
itself setting the number of lines to compare. Use --max-diff-input-lines LINES
to compare the number of lines (the maximum is 4,194,304). You can also set the maximum number of lines per diff
block with --max-diff-block-lines-saved LINES
.
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Halcyon Creates Anti-Ransomware Protection for Linux
As more Linux systems are targeted by ransomware, Halcyon is stepping up its protection.
-
Valve and Arch Linux Announce Collaboration
Valve and Arch have come together for two projects that will have a serious impact on the Linux distribution.
-
Hacker Successfully Runs Linux on a CPU from the Early ‘70s
From the office of "Look what I can do," Dmitry Grinberg was able to get Linux running on a processor that was created in 1971.
-
OSI and LPI Form Strategic Alliance
With a goal of strengthening Linux and open source communities, this new alliance aims to nurture the growth of more highly skilled professionals.
-
Fedora 41 Beta Available with Some Interesting Additions
If you're a Fedora fan, you'll be excited to hear the beta version of the latest release is now available for testing and includes plenty of updates.
-
AlmaLinux Unveils New Hardware Certification Process
The AlmaLinux Hardware Certification Program run by the Certification Special Interest Group (SIG) aims to ensure seamless compatibility between AlmaLinux and a wide range of hardware configurations.
-
Wind River Introduces eLxr Pro Linux Solution
eLxr Pro offers an end-to-end Linux solution backed by expert commercial support.
-
Juno Tab 3 Launches with Ubuntu 24.04
Anyone looking for a full-blown Linux tablet need look no further. Juno has released the Tab 3.
-
New KDE Slimbook Plasma Available for Preorder
Powered by an AMD Ryzen CPU, the latest KDE Slimbook laptop is powerful enough for local AI tasks.
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.