A modern diff utility
Command Line – diffoscope

© Lead Image © Vlad Kochelaevskiy, 123RF.com
With support for more than 60 file formats, diffoscope extends the power of diff beyond the plain text or HTML file.
The first command in Unix-like systems for comparing files and directories was diff
. Originally written by Douglas McIlroy and first appearing in Unix 5th Edition in 1974, diff
rapidly became an essential programming tool. Today, the original command is still available, and most programming languages have their own versions of diff
. However, diff
and its derivatives generally have one limitation: With few exceptions, most of them work only with plain text or markup languages like HTML. A new variation called diffoscope [1], which was released in mid-2020, brings a new level of functionality to file comparison.
Diffoscope is developed primarily by Debian's Reproducible Builds project [2], which aims to increase the robustness and security of Debian packages by ensuring that they always build the same way. Given Debian's nearly 60,000 packages and the variety of hardware available, this is no small task, especially considering that small errors in code can be hard to trace. Diffoscope was written to make this task easier by quickly tracking down differences between two files that are supposed to be identical but perform differently. As a side effect, diffoscope provides a modern diff
utility that works across most programing languages and brings the power of diff
to desktop users and non-programmers, especially writers who wish to compare drafts. Already, diffoscope supports over 60 binary formats that range from files and filesystems to audio and text files, including MS Word, LibreOffice Writer, and PDF (Table 1). And more seem likely to follow.
Table 1
Supported Formats
Android APK files |
LLVM IR bitcode files |
Android boot images |
LZ4 compressed files |
ar(1) archives |
macOS binaries |
Berkeley DB database files |
Microsoft Windows icon files |
bzip2 archives |
Microsoft Word .docx files |
Character/block devices |
Mono Portable Executable files |
ColorSync color profiles (.icc) |
Multimedia metadata |
coreboot CBFS filesystem images |
OCaml interface files |
cpio archives |
Ogg Vorbis audio files |
Dalvik .dex files |
OpenOffice/LibreOffice .odt files |
Directories |
OpenSSH public keys |
Debian buildinfo files |
OpenWRT package archives (.ipk) |
Debian .changes files |
PDF documents |
Debian source packages (.dsc) |
PGP signatures |
Device Tree Compiler blob files |
PGP signed/encrypted messages |
ELF binaries |
PNG images |
ext2/ext3/ext4/Btrfs/FAT filesystems |
PostScript documents |
freedesktop.org fontconfig cache files |
RPM archives |
Free Pascal files (.ppu) |
Rust object files (.deflate) |
gettext message catalogs |
SQLite databases |
GHC Haskell .hi files |
SquashFS filesystems |
GIF image files |
Statically linked binaries |
Git repositories |
Symlinks |
GNU R database files (.rdb) |
Tape archives (.tar) |
GNU R Rscript files (.rds) |
tcpdump capture files (.pcap) |
Gnumeric spreadsheets |
Text files |
Gzipped files |
TrueType font files |
ISO 9660 CD images |
WebAssembly binary module |
Java .class files |
XML binary schemas (.xsb) |
JavaScript files |
XML files |
JPEG images |
XZ compressed files |
JSON files |
Diffoscope's basic command structure is:
diffoscope FILE1 FILE2
If only one file or directory is given, then diffoscope attempts to compare the given file with the last file compared – a desperate act that will only occasionally be useful. For convenience, the command can be piped through less or more. You might also add the --progress
option for large files like DVD images. If you are dealing with large files, you might also run up against the built-in limits for output. Rather than resetting them, you can cancel all of them with the option --no-default-limits
.
Output is to standard output by default, but you can also save to file. The output shows the content of the first file in red text, with each line prefaced by a minus sign, and the content of the second file in white text prefaced by a plus sign. At the top of the output, you'll find statistics that vary with the file type. For example, in Figure 1, the files share LibreOffice's .odt
format, and the statistics are the file names, the amount of text in each file that differs, and the number of total words in each file. By contrast, in Figure 2, a directory diff
is prefaced by file listings, file permissions, and other attributes. The output is driven by context, ensuring that it is useful for more than the diff
itself.
Output Formatting Options
Besides standard input, diffoscope's output can be saved to several file formats. To write output to a text file, add the option --text OUTPUT-FILE
, giving the full path. You can also color-code an output text file with --text-color WHEN
, replacing when with never
, auto
, or always
. Color is enabled automatically in standard output, but disabled by default when you write to a file. Similarly, an HTML file is named with --html OUTPUT-FILE
. Color is not supported for HTML files, but you can write a multi-HTML file using --html OUTPUT-DIRECTORY
, so you can absorb the output in small chunks, and --css URL
to format the output as desired. If you are using JavaScript, both text and HTML output can be formatted using --jquery URL
. Other supported file format options are --json OUTPUT-FILE
, --markdown OUTPUT-FILE
, and --restructured-text OUTPUT_FILE
, all three of which can be used for either files or for standard output. In all these formats, --output-empty
can be used to write a file to report no differences.
Output Limit Options
Coming from an era of memory limitations, diff
is economical, by default writing just a few lines so that the context of a difference can be read. By contrast, diffoscope, written in mid-2020 has limits that are so high that, for all practical purposes, it often has no limits. Instead, if you want to limit diffoscope's output – perhaps to make the output more manageable – you have to deliberately add limits. The number of bytes in an output report is unlimited by default, but you can use --max-text-report-size BYTES
to define a limit. Alternatively, you can use --max-text-report-size BYTES
to change the default of 409,600, or, if using --html OUTPUT-DIRECTORY
, you can use --max-page-size-child BYTES
to change the size of the separate pages of an HTML report from the default of 204,800. Still another alternative is to change the default 1,024 lines for a unified-diff
block – that is, for separate chunks of the report. These options are primarily for comparisons of long files, such as .iso
images, and are generally irrelevant when dealing with files in MS Word or LibreOffice format unless you are comparing complete manuscripts.
Difference Calculation Options
A number of options modify how diffoscope makes its comparisons. --exclude GLOB_PATTERN
and --exclude-command REGEX_PATTERN
are different names for the same option and can be used with either files or directories. When working with directories, you can set whether permissions and other file attributes are used with --exclude-directory-metadata SETTING
, which can be completed with auto, yes, no, or recursive. In addition, you can opt to enable fuzzy logic, controlling how minor differences are handled. A setting of
means that all matches must be exact; however, the meaning of the default of 60
or the maximum of 400
has to be discovered through trial and error, since it is currently undocumented.
Other options are reminiscent of diff
itself setting the number of lines to compare. Use --max-diff-input-lines LINES
to compare the number of lines (the maximum is 4,194,304). You can also set the maximum number of lines per diff
block with --max-diff-block-lines-saved LINES
.
Buy Linux Magazine
Direct Download
Read full article as PDF:
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
News
-
Mageia 9 Beta 2 is Ready for Testing
The latest beta of the popular Mageia distribution now includes the latest kernel and plenty of updated applications.
-
KDE Plasma 6 Looks to Bring Basic HDR Support
The KWin piece of KDE Plasma now has HDR support and color management geared for the 6.0 release.
-
Bodhi Linux 7.0 Beta Ready for Testing
The latest iteration of the Bohdi Linux distribution is now available for those who want to experience what's in store and for testing purposes.
-
Changes Coming to Ubuntu PPA Usage
The way you manage Personal Package Archives will be changing with the release of Ubuntu 23.10.
-
AlmaLinux 9.2 Now Available for Download
AlmaLinux has been released and provides a free alternative to upstream Red Hat Enterprise Linux.
-
An Immutable Version of Fedora Is Under Consideration
For anyone who's a fan of using immutable versions of Linux, the Fedora team is currently considering adding a new spin called Fedora Onyx.
-
New Release of Br OS Includes ChatGPT Integration
Br OS 23.04 is now available and is geared specifically toward web content creation.
-
Command-Line Only Peropesis 2.1 Available Now
The latest iteration of Peropesis has been released with plenty of updates and introduces new software development tools.
-
TUXEDO Computers Announces InfinityBook Pro 14
With the new generation of their popular InfinityBook Pro 14, TUXEDO upgrades its ultra-mobile, powerful business laptop with some impressive specs.
-
Linux Kernel 6.3 Release Includes Interesting Features
Although it's not a Long Term Release candidate, Linux 6.3 includes features that will benefit end users.