An XML, HTML, and JSON data extraction tool

Easy Extraction

© Lead Image © Wutthichai Luemuang, 123RF.com

© Lead Image © Wutthichai Luemuang, 123RF.com

Author(s):

Xidel lets you easily extract and process data from XML, HTML, and JSON documents.

There are numerous ways to scrape a web page for data. In fact, the right mix of Python modules and Python logic glue could probably do the trick, but sometimes you just want a convenient tool that lets you extract data from websites. Xidel [1], a multi-platform command-line tool, offers a one-stop alternative to quickly extract, process, and save data from XML, HTML, or JSON documents.

Under the Hood

Xidel wraps XQuery, XPath, and JSON into one convenient front end. XQuery, a W3C Recommendation since 2007, lets you query XML or HTML files as if they were database servers, process the extracted data as desired, and save data to other files. As shown in the XQuery tutorial [2], XQuery-capable software can complete requests like finding all the CDs in an online catalog that cost less than $10, sorted by release date.

Xidel also fully supports the other W3C Recommendations, XPath [3] and the data-interchange language JavaScript Object Notation (JSON) [4]. XPath defines both a syntax for identifying all the elements of an XML document and a library of standard functions that make it easy to navigate through such elements and extract them. JSON data structures represent any kind of data as objects made of unordered sets of name/value pairs (I'll show some examples of this later on in this article).

Installation

You can download Xidel from the website [1] with just a few clicks. Xidel offers the choice between a binary package in DEB format or a ZIP archive that contains just five files: a digital certificate, the changelog, an exhaustive README file that explains in detail how Xidel works, the executable program, and its installer. The installer (Listing 1) should be run with administrator privileges. At 11 lines, the installer could hardly be simpler.

Listing 1

Installation Script

01 #!/bin/bash
02 PREFIX=$1
03 sourceprefix=
04 if [[ -d programs/internet/xidel/ ]]; then sourceprefix=programs/internet/xidel/; else sourceprefix=./;  fi
05 mkdir -p $PREFIX/usr/bin
06
07 install -v $sourceprefix/xidel $PREFIX/usr/bin
08 if [[ -f $sourceprefix/meta/cacert.pem ]]; then
09 mkdir -p $PREFIX/usr/share/xidel
10 install -v $sourceprefix/meta/cacert.pem $PREFIX/usr/share/xidel/;
11 fi

Listing 1 sets as the installation $PREFIX the directory passed as the first argument (line 2). On my computer, I chose the root folder (/), but you may prefer to use /opt or similar locations. Next, the script just uses the install program to copy the xidel executable and its certificate in $PREFIX's usr/bin and, respectively, usr/share/xidel subdirectories.

When I tried to launch the program after running the installer, I discovered that Xidel needs the developer versions of libopenssl and libcrypto (I couldn't find this problem documented at the time of writing). However, both libraries are available as native packages in the standard repositories of most distributions (e.g., libssl-dev on Debian derivatives, and openssl-devel on Fedora-based systems), so installing them takes a matter of minutes.

Main Features

Xidel can interact with websites if it has the proper data and instructions. It can log into websites on your behalf to perform tasks like updating personal information, submitting forms, or downloading private messages. Among other things, Xidel can reach websites using proxies, manage cookies, and pause between connections to prevent overloading servers and subsequently being banned. However, I do not cover these specific Xidel features for one simple reason: Websites change all the time, so any specific examples would be completely obsolete by the time you read this article. If you want to know how Xidel can, for example, handle your Reddit notifications, I recommend first checking the latest examples on the Xidel website and then if necessary asking for support on the Xidel mailing list (which I did to write this article).

As far as automatic data processing is concerned, Xidel reads and parses standard input or plain text files in JSON, XML, and HTML formats. After processing their content according to your instructions, Xidel can output the result in the same formats, as well as plain text or, as I will show later, shell variables. In addition, you can define the output separator between multiple items and create custom headers and footers for your data reports.

Xidel's two main modes, extract and follow, are often used together. In a nutshell, the extract mode extracts and processes data from the current document, if you just need to process the data inside one or more local files or web pages. The follow mode starts where extract leaves off by following all the links found by previous operations in order to download and process the links' content.

Xidel can run multiple extract and follow actions in the same call, as long as you write them in the right order and never ask to follow data that was not directly passed to Xidel or found by previous extract operations.

In extract mode, Xidel can recognize and select document elements by their CSS. If you want to process the extracted data, Xidel uses XPath 3.0 expressions. For more complex tasks, you can use the full XQuery standard to make Xidel run Turing-complete scripts, which StackExchange describes as "any algorithm you could think of, no matter how complex" [5].

However, when it's necessary to simultaneously extract multiple pieces of data at once, many times, from specific sections of pages with a fixed structure (e.g, titles and links of the most viewed topics in a forum), I recommend pattern matching, which I will discuss later.

Syntax-wise, as you will see in the examples I provide later, Xidel extract commands are one-liners that first pass to Xidel the file it should process and then, with the --extract= or -e option, a string that contains the actual operations to perform on the given document. When that string becomes so long that it's difficult to edit it on the command line, or you want to save it, you can write it to a file and pass the file to Xidel with the --extract-file option.

The option for the follow mode is --follow= or -f. As with extract, this option gives Xidel the expression that describes which element or sequence of elements should be followed. There are many other options for the follow mode, but with one exception they are almost all mirror versions of the extract options (e.g., you can save your commands in a file and pass it to Xidel with --follow-file). The exception, --follow-level, specifies the maximum recursion level when following pages from other pages. Set this carefully, because its default value is 99,999!

Other Options and Variables

In addition to the options that pass or configure the actual commands, --silent suppresses notifications, --verbose shows all notifications, and --debug-arguments shows how command-line arguments were parsed and your queries were evaluated. For support, add --help to get a list of all the available command-line options and --usage to read all the same documentation provided in the README file.

Last but not least, I want to mention global variables. Depending on your request, Xidel will provide the full input text available raw inside $raw or already parsed in JSON format inside $json. If you tell Xidel to download a web page, its full address will be inside $url, and its headers, host, and path will be inside $headers, $host, and $path respectively.

Practical Examples

Now I'll put Xidel to work and show you some practical examples. The simplest possible thing you can ask Xidel is for the title of a web page (auxiliary notifications have been omitted for readability in all examples):

#> xidel https://www.linux-magazine.com --extract //title
Linux Magazine

which yields the expected result. Notice that to indicate an element, in this case the title, you must prefix it with a double slash //. You can also give Xidel multiple commands in the same call:

#> xidel https://www.linux-magazine.com -e "//title" https://edition.cnn.com --extract //title
Linux Magazine
Breaking News, Latest News and Videos | CNN

In this example, I deliberately used both versions of the extract command.

Listing 2 shows the partial output of a simple follow command.

Listing 2

follow Output

#> xidel https://www.linux-magazine.com --follow //a --extract //title > all-titles
...
Sparkhaus Shop - Linux Magazines & Online-Services
Administration ª Linux Magazine
Desktop ª Linux Magazine
Web%20Development ª Linux Magazine
...
...

If you compare the four output lines in Listing 2 with the links highlighted in Figure 1, you can see that the command instructed Xidel to download all the pages linked from www.linux-magazine.com and then extract and print those pages' titles.

Figure 1: The titles output in Listing 2 are from the linked pages highlighted here on the Linux Magazine website.

Because XQuery is Turing complete, Xidel can use XQuery to create documents from scratch, without looking anywhere for data, as shown in the following command (also shown on the left side of Figure 2):

#> xidel --xquery '<table>{for $i in 0 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then "Linux Magazine" else "is great"}</td></tr>}</table>' --output-format xml > test.html
Figure 2: With Xidel, generating HTML tables or other documents is quick and simple.

This command outputs the test.html web page, which is the HTML table displayed by Firefox on the right side of Figure 2.

Figure 3 and Listing 3 show the --output-format option, which allows Xidel to influence every other Linux system component. Line 1 of Listing 3, whose output, as well as links to its sources, is shown in Figure 3, sets the Bash variable $title to the page's title and makes $links into an array of all the links on the same page. The --output-format bash command tells Xidel to load whatever it finds into Bash variables instead of writing the output to a file, with the specifics declared in the two -e options.

Listing 3

Setting Bash Environment Variables

01 #> eval "$(/home/marco/testing/xidel https://www.linux-magazine.com -e 'title:=//title' -e 'links:=//a/@href' --output-format bash)"
02
03 #>  echo $title
04 Linux Magazine
05
06 #>  printf '%s\n' "${links[@]}". | grep  '^/Online/'
07
08 /Online/News
09 /Online/Features
10 /Online/Blogs
11 /Online/White-Papers
12 /Online/News/Zorin-OS-16.3-is-Now-Available
13 /Online/News/SUSECON-2023
14 /Online/News/Mageia-9-RC1-Now-Available-for-Download
15 /Online/News/Linux-Mint-21.2-Now-Available-for-Installation
16 ...
Figure 3: After extracting data from a web page, Xidel can send the data everywhere, including a Linux terminal environment.

The first -e option (note its := syntax) copies the title of the given web page (//title) into a shell variable, which is called title for clarity, but as far as the shell is concerned, it could have any other name. The second -e option does the same thing, using //a/@href to signify that all the href values (i.e., the actual URLs) of all the HTML anchor tags (a) that define HTML hyperlinks should be copied inside links. Because there are many such tags, links will automatically become an array instead of a single cell. This is why, in the printf statement in line 6, links is referenced with curly and square brackets.

Cool, huh? But enough of HTML and XML, I'll turn to some JSON examples. In a previous article in Linux Magazine [6], I wrote about the Shaarli bookmark manager (Figure 4), which saves its bookmarks as one JSON array with the structure shown in Listing 4 (heavily edited for clarity).

Listing 4

Shaarli Bookmarks in JSON Format

#> jq '.' shaarli.json | more
[
  ... other records...
  {
    "id": 180,
    "url": "URL of this bookmark",
    "title": "Why AI Will Save The World",
    other name/value pairs...
    ...
  },
  {
    "id": 178,
    "url": "URL of this other bookmark",
    "title": "Let Them Eat Solar Panels",
    other names/values of this other bookmark...
    ...
  }
  ... many other records...
]
Figure 4: The Shaarli bookmark manager shows some of the bookmarks retrieved by Xidel.

Xidel will read the shaarli.json file that stores the JSON array and fetch each name/value pair of every record, as long as I know its position in the array (see Listing 5).

Listing 5

Extracting a Specific Value

#> xidel shaarli.json -e '$json(4).title' -e '$json(8).title'
Why AI Will Save The World
Open source licenses need to evolve to deal with AI

Because XQuery is Turing complete and Xidel automatically loads every JSON file you give it into an array called $json, I can use the one-liner loop in Listing 6 to scan the entire array and then filter, with the egrep shell command, all my bookmark titles about artificial intelligence (AI).

Listing 6

Extracting the Bookmark Titles

#> xidel shaarli.json  -e 'for $t in $json/title return string-join(("TITLE", $t), " ==>  ")' | egrep 'AI|ntelligence'
TITLE ==>  Why AI Will Save The World
TITLE ==>  AI Is Coming For Your Children
TITLE ==>  AI has poisoned its own well
TITLE ==>  This AI Boom Will Bust
...

The for command grabs all the title values inside $json and then loads each of them, one at a time, into the auxiliary variable $t. Then, the XQuery string-join function attaches the current title to the constant string TITLE, using as the connector the other string passed as the last argument.

If you want to extract more than one value per bookmark (e.g., both its ID number and title), you can use the concat, which unsurprisingly concatenates all the arguments it gets:

#> xidel -s shaarli.json -e '$json()/concat("ARTICLE|",id,"|",title)'

Alternatively, you can use the extended string syntax (note the backticks!) as follows:

#>xidel -s shaarli.json -e '$json()/`ARTICLE|{id}|{title}`'

Both commands will produce the same output, part of which is visible in Listing 7. What is really important in both cases is the $json() notation, where the empty parentheses indicate the entire array, not just one of its elements. That's what makes Xidel extract and format, as shown in the second part of the command, the ID and title values of every bookmark.

Listing 7

A Pipe-Separated, Plain Text Excerpt

ARTICLE|102|EU passes landmark AI Act to rein in high-risk tech
ARTICLE|134|AI has poisoned its own well
ARTICLE|135|Study says AI data contaminates vital human input
ARTICLE|176|Open source licenses need to evolve to deal with AI
ARTICLE|178|AI Is Coming For Your Children
ARTICLE|180|Why AI Will Save The World

Finally, if I want the id column in Listing 7 to always be four characters wide (this works because I have fewer than 10,000 bookmarks), I can tell Xidel to always pad the id value with enough white spaces – even for ID values smaller than 1,000 – by replacing it with the expression

substring(" ",1,4 - string-length(id))||id

which replaces the leftmost part of a string made of four spaces with the current value of id.

As far as I am concerned, the capability illustrated in these last JSON examples shows Xidel's real power, and my main reason for using it. Formats like the one shown in Listing 7 may be too limited for sophisticated uses, but it would be very hard to find a better compromise between immediate readability, ease of conversion to any other format, and (perhaps most important to me) long-term guaranteed usability, regardless of the software available.

Patterns

Xidel's support for patterns can be very useful if you need to periodically extract dynamic data from web pages with a complex but constant structure, such as an ever-changing Top 10 list on a web forum. Although a thorough presentation on Xidel patterns is outside the scope of this article, I want to briefly describe Xidel, because it may encourage others to try this program.

Listing 8 shows a snippet taken from the Xidel documentation of a pattern file that Xidel can use to fetch titles and links of the "recommended videos" listed on a YouTube page.

Listing 8

Fetching with Patterns

< li class="video-list-item">
  <!-- skipped -->
  Idras Sunhawk Lyrics
< li class="video-list-item">
  <a>
    {$title:=.}

The {$title:=.} marker in the last line of Listing 8 shows the XPath syntax to tell Xidel that every time it finds that particular sequence of CSS elements – a list item (li) with CSS class video-list-item, followed by an anchor tag (<a>), followed by a span element whose class is title – then the value of that last element is data that should be saved in a variable called $title.

Visually, Xidel pattern files look similar to Bash here documents or Perl templates, because in all cases, you have a fixed grid, or "mask" of text, in which the elements of interest occupy a fixed place. The difference is that Bash and Perl use those tools to show where already existing variables should be placed, or written, whereas Xidel patterns do just the opposite. At their core, Xidel patterns are regular expressions, too long to fit in one line, that show where to read the values that should be saved, as well as which variables should store them.

Xidel patterns, however, are more powerful than regular expressions, with many options to control how they process what they find. For instance, a Greasemonkey script [7] can create Xidel patterns by just selecting the text to scrape on the corresponding web page.

Conclusion

While there are many other ways to scrape and convert XML, HTML, and JSON documents, Xidel is a multiplatform tool with almost no dependencies. It's fun, relatively easy to learn, and its mailing list is very responsive (I want to specifically thank user Reino for explaining how to process JSON arrays with Xidel).

Above all, Xidel is a simple package that can grab and reorganize data in several standard formats, from very different sources, in a very flexible, efficient way. You can even test Xidel online [8]. I recommend giving Xidel a try.

The Author

Marco Fioretti (http://mfioretti.substack.com) is a freelance author, trainer, and researcher based in Rome, Italy, who has been working with free/open source software since 1995, and on open digital standards since 2005. Marco also is a board member of the Free Knowledge Institute (http://freeknowledge.eu).