An XML, HTML, and JSON data extraction tool
Easy Extraction
Xidel lets you easily extract and process data from XML, HTML, and JSON documents.
There are numerous ways to scrape a web page for data. In fact, the right mix of Python modules and Python logic glue could probably do the trick, but sometimes you just want a convenient tool that lets you extract data from websites. Xidel [1], a multi-platform command-line tool, offers a one-stop alternative to quickly extract, process, and save data from XML, HTML, or JSON documents.
Under the Hood
Xidel wraps XQuery, XPath, and JSON into one convenient front end. XQuery, a W3C Recommendation since 2007, lets you query XML or HTML files as if they were database servers, process the extracted data as desired, and save data to other files. As shown in the XQuery tutorial [2], XQuery-capable software can complete requests like finding all the CDs in an online catalog that cost less than $10, sorted by release date.
Xidel also fully supports the other W3C Recommendations, XPath [3] and the data-interchange language JavaScript Object Notation (JSON) [4]. XPath defines both a syntax for identifying all the elements of an XML document and a library of standard functions that make it easy to navigate through such elements and extract them. JSON data structures represent any kind of data as objects made of unordered sets of name/value pairs (I'll show some examples of this later on in this article).
Installation
You can download Xidel from the website [1] with just a few clicks. Xidel offers the choice between a binary package in DEB format or a ZIP archive that contains just five files: a digital certificate, the changelog, an exhaustive README file that explains in detail how Xidel works, the executable program, and its installer. The installer (Listing 1) should be run with administrator privileges. At 11 lines, the installer could hardly be simpler.
Listing 1
Installation Script
01 #!/bin/bash 02 PREFIX=$1 03 sourceprefix= 04 if [[ -d programs/internet/xidel/ ]]; then sourceprefix=programs/internet/xidel/; else sourceprefix=./; fi 05 mkdir -p $PREFIX/usr/bin 06 07 install -v $sourceprefix/xidel $PREFIX/usr/bin 08 if [[ -f $sourceprefix/meta/cacert.pem ]]; then 09 mkdir -p $PREFIX/usr/share/xidel 10 install -v $sourceprefix/meta/cacert.pem $PREFIX/usr/share/xidel/; 11 fi
Listing 1 sets as the installation $PREFIX
the directory passed as the first argument (line 2). On my computer, I chose the root folder (/
), but you may prefer to use /opt
or similar locations. Next, the script just uses the install
program to copy the xidel
executable and its certificate in $PREFIX
's usr/bin
and, respectively, usr/share/xidel
subdirectories.
When I tried to launch the program after running the installer, I discovered that Xidel needs the developer versions of libopenssl and libcrypto (I couldn't find this problem documented at the time of writing). However, both libraries are available as native packages in the standard repositories of most distributions (e.g., libssl-dev on Debian derivatives, and openssl-devel on Fedora-based systems), so installing them takes a matter of minutes.
Main Features
Xidel can interact with websites if it has the proper data and instructions. It can log into websites on your behalf to perform tasks like updating personal information, submitting forms, or downloading private messages. Among other things, Xidel can reach websites using proxies, manage cookies, and pause between connections to prevent overloading servers and subsequently being banned. However, I do not cover these specific Xidel features for one simple reason: Websites change all the time, so any specific examples would be completely obsolete by the time you read this article. If you want to know how Xidel can, for example, handle your Reddit notifications, I recommend first checking the latest examples on the Xidel website and then if necessary asking for support on the Xidel mailing list (which I did to write this article).
As far as automatic data processing is concerned, Xidel reads and parses standard input or plain text files in JSON, XML, and HTML formats. After processing their content according to your instructions, Xidel can output the result in the same formats, as well as plain text or, as I will show later, shell variables. In addition, you can define the output separator between multiple items and create custom headers and footers for your data reports.
Xidel's two main modes, extract
and follow
, are often used together. In a nutshell, the extract
mode extracts and processes data from the current document, if you just need to process the data inside one or more local files or web pages. The follow
mode starts where extract
leaves off by following all the links found by previous operations in order to download and process the links' content.
Xidel can run multiple extract
and follow
actions in the same call, as long as you write them in the right order and never ask to follow data that was not directly passed to Xidel or found by previous extract
operations.
In extract
mode, Xidel can recognize and select document elements by their CSS. If you want to process the extracted data, Xidel uses XPath 3.0 expressions. For more complex tasks, you can use the full XQuery standard to make Xidel run Turing-complete scripts, which StackExchange describes as "any algorithm you could think of, no matter how complex" [5].
However, when it's necessary to simultaneously extract multiple pieces of data at once, many times, from specific sections of pages with a fixed structure (e.g, titles and links of the most viewed topics in a forum), I recommend pattern matching, which I will discuss later.
Syntax-wise, as you will see in the examples I provide later, Xidel extract
commands are one-liners that first pass to Xidel the file it should process and then, with the --extract=
or -e
option, a string that contains the actual operations to perform on the given document. When that string becomes so long that it's difficult to edit it on the command line, or you want to save it, you can write it to a file and pass the file to Xidel with the --extract-file
option.
The option for the follow
mode is --follow=
or -f
. As with extract
, this option gives Xidel the expression that describes which element or sequence of elements should be followed. There are many other options for the follow
mode, but with one exception they are almost all mirror versions of the extract
options (e.g., you can save your commands in a file and pass it to Xidel with --follow-file
). The exception, --follow-level
, specifies the maximum recursion level when following pages from other pages. Set this carefully, because its default value is 99,999!
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Canonical Releases Ubuntu 24.04
After a brief pause because of the XZ vulnerability, Ubuntu 24.04 is now available for install.
-
Linux Servers Targeted by Akira Ransomware
A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.