A modern diff utility

Command Line – diffoscope

© Lead Image © Vlad Kochelaevskiy, 123RF.com

© Lead Image © Vlad Kochelaevskiy, 123RF.com

Article from Issue 240/2020
Author(s):

With support for more than 60 file formats, diffoscope extends the power of diff beyond the plain text or HTML file.

The first command in Unix-like systems for comparing files and directories was diff. Originally written by Douglas McIlroy and first appearing in Unix 5th Edition in 1974, diff rapidly became an essential programming tool. Today, the original command is still available, and most programming languages have their own versions of diff. However, diff and its derivatives generally have one limitation: With few exceptions, most of them work only with plain text or markup languages like HTML. A new variation called diffoscope [1], which was released in mid-2020, brings a new level of functionality to file comparison.

Diffoscope is developed primarily by Debian's Reproducible Builds project [2], which aims to increase the robustness and security of Debian packages by ensuring that they always build the same way. Given Debian's nearly 60,000 packages and the variety of hardware available, this is no small task, especially considering that small errors in code can be hard to trace. Diffoscope was written to make this task easier by quickly tracking down differences between two files that are supposed to be identical but perform differently. As a side effect, diffoscope provides a modern diff utility that works across most programing languages and brings the power of diff to desktop users and non-programmers, especially writers who wish to compare drafts. Already, diffoscope supports over 60 binary formats that range from files and filesystems to audio and text files, including MS Word, LibreOffice Writer, and PDF (Table 1). And more seem likely to follow.

Table 1

Supported Formats

Android APK files

LLVM IR bitcode files

Android boot images

LZ4 compressed files

ar(1) archives

macOS binaries

Berkeley DB database files

Microsoft Windows icon files

bzip2 archives

Microsoft Word .docx files

Character/block devices

Mono Portable Executable files

ColorSync color profiles (.icc)

Multimedia metadata

coreboot CBFS filesystem images

OCaml interface files

cpio archives

Ogg Vorbis audio files

Dalvik .dex files

OpenOffice/LibreOffice .odt files

Directories

OpenSSH public keys

Debian buildinfo files

OpenWRT package archives (.ipk)

Debian .changes files

PDF documents

Debian source packages (.dsc)

PGP signatures

Device Tree Compiler blob files

PGP signed/encrypted messages

ELF binaries

PNG images

ext2/ext3/ext4/Btrfs/FAT filesystems

PostScript documents

freedesktop.org fontconfig cache files

RPM archives

Free Pascal files (.ppu)

Rust object files (.deflate)

gettext message catalogs

SQLite databases

GHC Haskell .hi files

SquashFS filesystems

GIF image files

Statically linked binaries

Git repositories

Symlinks

GNU R database files (.rdb)

Tape archives (.tar)

GNU R Rscript files (.rds)

tcpdump capture files (.pcap)

Gnumeric spreadsheets

Text files

Gzipped files

TrueType font files

ISO 9660 CD images

WebAssembly binary module

Java .class files

XML binary schemas (.xsb)

JavaScript files

XML files

JPEG images

XZ compressed files

JSON files

Diffoscope's basic command structure is:

diffoscope FILE1 FILE2

If only one file or directory is given, then diffoscope attempts to compare the given file with the last file compared – a desperate act that will only occasionally be useful. For convenience, the command can be piped through less or more. You might also add the --progress option for large files like DVD images. If you are dealing with large files, you might also run up against the built-in limits for output. Rather than resetting them, you can cancel all of them with the option --no-default-limits.

Output is to standard output by default, but you can also save to file. The output shows the content of the first file in red text, with each line prefaced by a minus sign, and the content of the second file in white text prefaced by a plus sign. At the top of the output, you'll find statistics that vary with the file type. For example, in Figure 1, the files share LibreOffice's .odt format, and the statistics are the file names, the amount of text in each file that differs, and the number of total words in each file. By contrast, in Figure 2, a directory diff is prefaced by file listings, file permissions, and other attributes. The output is driven by context, ensuring that it is useful for more than the diff itself.

Figure 1: File comparison includes stats useful for the format.
Figure 2: Diffoscope can also compare directory contents and structure.

Output Formatting Options

Besides standard input, diffoscope's output can be saved to several file formats. To write output to a text file, add the option --text OUTPUT-FILE, giving the full path. You can also color-code an output text file with --text-color WHEN, replacing when with never, auto, or always. Color is enabled automatically in standard output, but disabled by default when you write to a file. Similarly, an HTML file is named with --html OUTPUT-FILE. Color is not supported for HTML files, but you can write a multi-HTML file using --html OUTPUT-DIRECTORY, so you can absorb the output in small chunks, and --css URL to format the output as desired. If you are using JavaScript, both text and HTML output can be formatted using --jquery URL. Other supported file format options are --json OUTPUT-FILE, --markdown OUTPUT-FILE, and --restructured-text OUTPUT_FILE, all three of which can be used for either files or for standard output. In all these formats, --output-empty can be used to write a file to report no differences.

Output Limit Options

Coming from an era of memory limitations, diff is economical, by default writing just a few lines so that the context of a difference can be read. By contrast, diffoscope, written in mid-2020 has limits that are so high that, for all practical purposes, it often has no limits. Instead, if you want to limit diffoscope's output – perhaps to make the output more manageable – you have to deliberately add limits. The number of bytes in an output report is unlimited by default, but you can use --max-text-report-size BYTES to define a limit. Alternatively, you can use --max-text-report-size BYTES to change the default of 409,600, or, if using --html OUTPUT-DIRECTORY, you can use --max-page-size-child BYTES to change the size of the separate pages of an HTML report from the default of 204,800. Still another alternative is to change the default 1,024 lines for a unified-diff block – that is, for separate chunks of the report. These options are primarily for comparisons of long files, such as .iso images, and are generally irrelevant when dealing with files in MS Word or LibreOffice format unless you are comparing complete manuscripts.

Difference Calculation Options

A number of options modify how diffoscope makes its comparisons. --exclude GLOB_PATTERN and --exclude-command REGEX_PATTERN are different names for the same option and can be used with either files or directories. When working with directories, you can set whether permissions and other file attributes are used with --exclude-directory-metadata SETTING, which can be completed with auto, yes, no, or recursive. In addition, you can opt to enable fuzzy logic, controlling how minor differences are handled. A setting of   means that all matches must be exact; however, the meaning of the default of 60 or the maximum of 400 has to be discovered through trial and error, since it is currently undocumented.

Other options are reminiscent of diff itself setting the number of lines to compare. Use --max-diff-input-lines LINES to compare the number of lines (the maximum is 4,194,304). You can also set the maximum number of lines per diff block with --max-diff-block-lines-saved LINES.

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Free Writing Tools

    Some tools designed for programming can also be very helpful for writing fiction. A few to look at include personal wikis, random word generators, and version control tools.

  • Command Line – diff and merge

    Diff and merge: They're not just for developers.

  • Command Line: Diffutils

    The Diffutils tool set helps you compare text files, discover and display the differences between files, and even automatically synchronize files.

  • Command Line – Disposable VMs

    Debvm lets you quickly create a temporary virtual machine with a small memory footprint, ideal for testing scripts or mixing repositories.

  • BeeDiff

    BeeDiff compares two files and quickly displays the differences in a convenient desktop GUI interface.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News