Transform web pages into EPUB files
Read at Will

© Photo by Gülfer ERGIN
Instead of relying on a third-party read-it-later service, you can use this DIY tool to save articles from the Internet in a format that meets your specific needs.
Few of us have time to read long-form web articles during the day, which is why services that let you save interesting reads for later can come in handy. Popular services such as Pocket and Instapaper even offer apps you can use to read the saved content offline on your preferred device. Better still, the saved articles are reformatted for better readability and scrubbed of all ads, scripts, trackers, and other junk.
Hosted services are like restaurants, though. No matter how great the food and the service, you eventually start longing for home-cooked meals, not only because cooking at home is cheaper and more convenient, but because you can make any dish you wish just the way you like it and have fun in the process. In a similar vein, why settle for a ready-made, read-it-later service, when you can cook up your very own solution with a bit of creative thinking, the right mix of open source tools, and a dash of shell scripting magic? That's exactly what is on today's menu: a DIY read-it-later tool.
Instead of saving and serving slimmed down versions of web pages, this DIY read-it-later application is going to process pages and transform them into ePub files. This way, you can read the saved content on practically any device, and you can choose whatever ebook reading app you like. Because the DIY read-it-later tool is a simple shell script that relies on Linux tools, you don't need a server to host it. If necessary, you can run the tool on a remote Linux machine and serve ePub files via a dedicated Open Publication Distribution System (OPDS) server or simply publish the files on the web. In short, the DIY read-it-later tool gives you plenty of room for experimenting and setting up the solution that works best for your specific needs. Moreover, the fact that an ePub file is essentially a ZIP archive containing an XHTML file along with stylesheets, fonts, and so on makes the saved content future-proof and editable.
Preparatory Work
You don't have to code the DIY read-it-later tool from scratch, because I've already done the hard work for you and published the fruits of my labor, readiculous.sh
, on GitHub [1]. All you need to do is download the source code as a ZIP archive and unpack it, or clone the project's Git repository using the command:
git clone https://github.com/dmpop/readiculous.git
Before getting down to the nitty-gritty, you need to do some preparatory work. The first order of business is to install the required software. The main readiculous.sh
shell script relies on Pandoc, ImageMagick, jq
, wget
, and Go-Readability [2]. With the exception of Go-Readability, all of these dependencies are available in the official software repositories of most mainstream Linux distributions, so you can install them using the default package manager. To do this on Debian or an Ubuntu-based distribution, run the command:
sudo apt install pandoc imagemagick § jq wget
The source code on GitHub [1] includes a binary version of the Go-Readability tool compiled for the x86_64 architecture. If you plan to use the script on any other platform, or you want to have the very latest version of the tool, you will have to compile it yourself. Fortunately, it's a rather straightforward thing to do. Install the Go language package (use the sudo apt install golang
command on Debian and Ubuntu), and then run the following command to compile the command-line version of Go-Readability:
go get -u -v github.com/go-shiori/go-readability/cmd/...
Once the compiling process is finished, you'll find the resulting binary in the ~/go/bin
directory. Move the binary file into the readiculous
directory, and you're done.
How It Works
The readiculous.sh
script (Listing 1) starts working by fetching the desired page, scrubbing it clean, and reformatting it for better readability. To do all that, the script uses the nifty Go-Readability tool. Go-Readability also extracts the page title and passes it to ImageMagick, which creates a cover image with the obtained title. Finally, the Pandoc tool transforms the saved page into an ePub file complete with the generated cover.
Listing 1
readiculous.sh
The script accepts three parameters: -u
, -d
, and -m
. The mandatory -u
parameter specifies the URL of the target page, while the optional -d
parameter determines in which subdirectory the resulting ePub file should be saved. If the -d
parameter is omitted, the script saves ePub files in the default Library
directory. By specifying the subfolder, you can automatically sort the created ePub files by topic (for example, Language, Travel, Long Reads, and so on), or any other criteria. The -m
parameter allows you to convert several saved URLs at once, but I'll take a closer look at it later. The script uses a combination of the getopts
tool, the do...done
loop, and the case in
control structure to read the values passed by the specified parameters and assign these values to variables (lines 34-50 in Listing 1). If the default Library
directory doesn't exist, the script creates it (lines 52-57).
Listing 1's readicule()
function does the actual work. First, Go-Readability obtains the metadata of the specified page. The metadata is returned in the JSON format, and the jq
tool extracts the title, while the tr
tool strips double quotes (line 61). The same Go-Readability tool fetches the page using the specified URL and saves the processed version as an HTML file (line 63).
The next step is to create a cover for use with the ePub file. Strictly speaking, covers are not necessary, but they do make it easier to find the file you need in the library, and they make the ePub file look less bland. To generate a cover, the script uses the wget
tool for fetching a random 1024x800 image from the Lorem Picsum service and saves the file as cover.jpg
(line 65). Then, the convert
tool superimposes the obtained title onto the cover image (line 66).
There are, of course, plenty of other ways to create covers if you don't want the script to rely on a third-party service. For example, you can create covers with random background colors. To do this, you need to tweak the script so that it generates three random numbers between 0
and 255
. The convert
tool can then use the numbers as red, green, and blue values for generating a cover:
r=$(shuf -i 0-255 -n 1) g=$(shuf -i 0-255 -n 1) b=$(shuf -i 0-255 -n 1) convert -size 800x1024 xc:rgb\($r,$g,$b\) cover.jpg
If solid colors are not your cup of tea, you can use the convert
tool to generate a random colorful fractal image and specify the -paint
and -blur
options for a more artistic effect:
convert -size 800x1024 plasma:fractal -paint 10 -blur 10x20 cover.png
Finally, Pandoc finishes the task. It assembles the saved HTML file, the generated cover, and the obtained data into an ePub file and saves it either in the default directory (line 71) or in the subdirectory specified by the -d
parameter.
But that's not all. If you read a lot, running the script every time you want to save a page for later can quickly become a nuisance. That's why the script also features the -m
parameter. When specified with the auto
value, the script picks URLs from the links.txt
file one by one and generates ePub files for each one. The if...then...fi
block that starts on line 79 checks whether the $mode
value is set to auto
. If so, the while...do
loop (lines 86-90) reads URLs from the links.txt
file and calls the readicule()
function to generate ePub files. If the $mode
value is not specified, the script simply calls the function to generate an ePub file using the URL passed by the -u
parameter.
To speed up the process of transforming articles into ePub files, you can create a simple helper script:
#!/usr/bin/env bash url=$(xclip -o) echo $url cd /path/to/readiculous ./readiculous.sh -u $url notify-send "Added to Readiculous"
Replace /path/to/readiculous
with the actual path to the readiculous
directory, and save the script under an appropriate name (for example, add-to-readiculous.sh
). Install the xclip
tool on your system, and assign a keyboard shortcut to the script.
The Matter of Reading
Saving articles in the ePub format means that you read them using practically any device on any platform. Better yet, if you use Apple Books or Google Books, you can take advantage of the features these apps offer, including synchronization across multiple devices, saving highlights, library management functionality, and more.
However, if you've gone to the trouble of rolling out your own read-it-later tool, it probably doesn't make much sense to use a third-party commercial platform for reading. Enter KOReader [3], an open source ebook reader application available for Linux, Android, and a slew of dedicated readers. Despite its deceptively simple interface, KOReader packs an impressive array of features, including syncing, highlights, gesture support, note-taking capabilities, extensions, and much, much more (Figure 1). So if you want to keep your entire read-it-later toolchain open source, you should use KOReader.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.
News
-
The GNU Project Celebrates Its 40th Birthday
September 27 marks the 40th anniversary of the GNU Project, and it was celebrated with a hacker meeting in Biel/Bienne, Switzerland.
-
Linux Kernel Reducing Long-Term Support
LTS support for the Linux kernel is about to undergo some serious changes that will have a considerable impact on the future.
-
Fedora 39 Beta Now Available for Testing
For fans and users of Fedora Linux, the first beta of release 39 is now available, which is a minor upgrade but does include GNOME 45.
-
Fedora Linux 40 to Drop X11 for KDE Plasma
When Fedora 40 arrives in 2024, there will be a few big changes coming, especially for the KDE Plasma option.
-
Real-Time Ubuntu Available in AWS Marketplace
Anyone looking for a Linux distribution for real-time processing could do a whole lot worse than Real-Time Ubuntu.
-
KSMBD Finally Reaches a Stable State
For those who've been looking forward to the first release of KSMBD, after two years it's no longer considered experimental.
-
Nitrux 3.0.0 Has Been Released
The latest version of Nitrux brings plenty of innovation and fresh apps to the table.
-
Linux From Scratch 12.0 Now Available
If you're looking to roll your own Linux distribution, the latest version of Linux From Scratch is now available with plenty of updates.
-
Linux Kernel 6.5 Has Been Released
The newest Linux kernel, version 6.5, now includes initial support for two very exciting features.
-
UbuntuDDE 23.04 Now Available
A new version of the UbuntuDDE remix has finally arrived with all the updates from the Deepin desktop and everything that comes with the Ubuntu 23.04 base.