diff and merge

Repurposing Old Tools

© Lead Image © Ion Chiosea, 123RF.com

© Lead Image © Ion Chiosea, 123RF.com

Article from Issue 204/2017
Author(s):

Diff and merge: They're not just for developers.

Recently, a friend of mine returned to a manuscript after several months. The manuscript had half a dozen versions, and she could no longer remember how each one differed. Listening to her problem, I had a blinding flash of the obvious: diff [1], and related commands like diff3 [2] and merge [3], can be as much help to her as they have been to coders over the decades.

diff is a utility that compares two files line by line. For coders, diff is a command that defines Unix-like operating systems like Linux. Although file comparison utilities are as old as Unix, diff itself was first released in 1974 for text files, with support for binary files added later. diff presents users with a summary of the comparison in two different formats, which can also be merged into a single file. diff3 [2], a similar utility, operates in a like manner on three files, although it does not support binary formats. More sophisticated tools like patch have been developed, but diff is still installed by default in many distributions, and its output files, or diffs, remain a standard name for any patch, just as the grep command has given its name to any file search.

Basic Comparisons

Typing info diff (the man page is incomplete) quickly shows how diff can be as useful to a writer as a programmer. The command follows the standard format of a command followed by options and two files. The first file is the original, or any file if, as in my friend's case, the original is unknown or irrelevant:

diff OPTIONS ORIGINAL-FILE OTHER-FILE

Just by adding the --brief (-q) option, a writer can tell if the files are different – something that file attributes alone cannot always show. Similarly, --report-identical-files (-s) either reports when the files are the same or displays the differences (Figure 1). In some situations, like my friend's, this information alone may be enough to let some files be ignored.

Figure 1: The quickest way to use diff is to check whether two files differ or are the same.

Even more efficiently, directories can be specified instead of files, with --recursive (-r) added to include subdirectories in order to locate identical files. In the same way, the --from-file=DIRECTORY1 and --to-file=DIRECTORY2 options can be used to compare files of the same name in different directories. With --exclude=PATTERN (-x), files that match the pattern are excluded, while --exclude-from=FILE (-X) excludes files that match the patterns that are listed, one per line, in the designated file. Still other options when comparing directories are the self-explanatory --starting-file=FILE (-S FILE), --exclude=PATTERN (-x PATTERN), --ignore-file-name-case, and --no-ignore-file-name-case. All these options make for a more targeted search, and, although they take a while to set up, are still much faster than opening all the files for comparisons.

However, the comparison can be far more specific. Some options, such as --show-c-function (-p) are specific to programming, but others apply to regular text as easily as code. You can, for example, use --ignore-all-space (-w) so that differences in white space are not considered. Similarly, when comparing plain text files, using --ignore-blank-lines (-B) ignores the blank lines that are being used to separate paragraphs. A particularly useful option is --ignore-matching-lines=REGULAR-EXPRESSION, which can help to focus results.

More specifically, experts can specify what to display with -GTYPE-group-format=. The option can be completed to specify, in this order, lines from the original file (NUMBER <), lines from the second file (NUMBER >), or lines common to both (NUMBER=). Similarly, --LTYPE-line-format= can be completed by the first line number (F=), last line number (L=), and the number of lines (N=). Both options have a number of other completions, so consult the man or info page for more details.

Output Formats

By default, diff displays the lines where differences occur in a set format (Figure 2). If the files are identical, there will be no output whatsoever. However, assuming some output is produced, at the top of the display is a summary, such as 5,6c7. This summary displays the line number or lines where differences occur in the original file on the left, and the line number in the other file on the right. In between is one of three letters: c (change), a (append), or d (delete). Below the summary, the name of the original file is given first, marked by a lesser than (<) sign. Below it, the second file is marked by a greater than sign (>). For each difference, context lines are given to make the difference easier to find. The default number of context lines is three, but you change them by adding the option --context=NUMBER (-c or -C NUMBER).

Figure 2: diff output consists of a comparison summary, plus context lines around where differences occur.

An even easier output display can be had by adding --side-by-side (-y) to the command. This option displays the original file's contents on the left and the second file on the right, making detailed comparisons easy (Figure 3). You can adjust the column widths for a side-by-side display up to a maximum of 130 characters with --width=NUMBER (-W NUMBER). Another option is to set --left-column, so that only common lines are shown.

Figure 3: One of the easiest ways to use diff is to display output in two columns.

Regardless of which of these two output formats you use, the display is noticeably more flexible than that offered by LibreOffice's Edit | Track Changes, which can require far more concentration to read. If you open a second copy of the original, you can merge the files manually as you compare diff's results. A manual comparison is laborious, but it may be the best way to compare results.

A third alternative is to to use --ifdef=NAME (-D NAME) to create an output merged file (Figure 4). This output can be copied and pasted into a new file, where a writer can manually merge. However, if you are confident that the two files can be merged to get the results that you want, you can use --ed (-e) to actually merge the file. In programming, --ed is used to generate a patch, yet it can serve a writer's purpose just as well.

Figure 4: diff can also create a merged output file.

In all formats, you can further customize by adding --color (Figure 5). Left unspecified, the --color option will use color when standard output is to a terminal. However, you can also complete the option with =none to never use color or =always. By default, red is used for deleted lines, green for added lines, cyan for line numbers, and a bold font weight for the header. Colors can be customized with --palette=PALETTE, as specified in the diff info file.

Figure 5: Adding color can make diff output easier to read.

diff3 and merge

diff's obvious limitation is that the original file must be compared against each of the other files one at a time. A quicker method is to use diff3 or merge to compare two files simultaneously with the original.

Like diff, the first file listed by diff3 is the original. The default output of the two commands is also similar, although diff3 uses a back slash (\) for the original file, a lesser than sign (<) for the second, and a greater than sign (>) for the third. In addition, diff3 can add --show-all (-A) to output all changes, with conflicts listed in brackets. diff3's output can also be set to show only overlaps with --show-overlap (-E) or non-overlaps with --easy-only (-3v). Other options for input include --ed (-ed), which diff3 shares with diff, and --merge (-m), diff3's more sensibly named version of diff's --ifdef=NAME (-D NAME).

merge is a near-duplicate of diff3. However, instead of providing output that can be copied and pasted into a new file, merge adds everything to the original file. This behavior is not a problem if there are no conflicts. However, if conflicts do exist, merge warns of them, and the original file will need editing. This extra effort is not much trouble in a plain text file, but in a binary format like Open Document Format, it could potentially corrupt the original file. The same is true for -A, which, as in diff3, offers more verbose output. For this reason, only use merge after making a backup of the original.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Command Line: Diffutils

    The Diffutils tool set helps you compare text files, discover and display the differences between files, and even automatically synchronize files.

  • BeeDiff

    BeeDiff compares two files and quickly displays the differences in a convenient desktop GUI interface.

  • Dwdiff Shows Changes Word for Word

    Version 1.3 of dwdiff, a free front-end for the Unix diff program has just been released and is now licensed under the GPLv3.

  • Diff Algorithms

    Diff finds the differences between two versions of a file. We’ll show you how diff finds changes and matches in files without affecting a system's resources.

  • Debugging for Admins

    Whether you’re the sys admin of a home network or of a company-wide network of dozens or even hundreds of machines, some basic principles of debugging will come in handy.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News