Build your own crawlers
Spider, Spider
Scrapy is an open source framework written in Python that lets you build your own crawlers with minimal effort for professional results.
A crawler demonstrates the capabilities of version 1.0 of the Scrapy framework [1] running under Python 2.7 [2]. Scrapy is an open source framework for extracting data from websites. It recursively crawls through HTML documents and follows all the links it finds.
In the spirit of HTML5, the test created in this article is designed to reveal non-semantic markup on websites. The crawler counts the number of words used per page, as well as the number of characteristic tag groups (Table 1), saving the results along with the URL in a database.
Table 1
Definition of Stored Metrics
Metric | Meaning |
---|---|
keywords |
Number of words in the <title> tag |
words |
Number of all words except keywords |
relevancy |
Frequency of keywords in the total number of words |
tags |
Total number of all tags |
semantics |
Total number of all semantic tags |
links |
Total number of all <a> tags with an <href> attribute |
injections |
Number of third-party resources |
To install the required packages, I used the Debian 8 Apt package manager:
apt-get install python-pip libxslt1-dev python-dev python-lxml
The packages include the Python package manager (Pip), the libxslt
library along with the header files, the Python header files, and the Python bindings for libxml
and libxslt
. Because Debian 8 comes with Python 2.7 and libxml
pre-installed, you can install Scrapy as follows:
pip install scrapy
Unlike Apt, Pip installs as the latest Scrapy version for Python 2.7 from the Python package index [3].
Test Run
To begin, open an interactive session in the Scrapy shell by entering scrapy shell
(Figure 1). Next, send a command to the Scrapy engine to tell the on-board downloader to read the German Linux-Magazin homepage through an HTTP request and transfer the results to the response
object (Figure 2):
fetch('http://www.linux-magazin.de')
Figure 3 demonstrates in detail how the components of the Scrapy architecture work together. This illustration makes it clear that the engine does not talk directly to the downloaders but first passes the HTTP request to the scheduler (Figure 3, top). The downloader middleware (Figure 3, center right) modifies the HTTP request before deployment. CookiesMiddleware
, which is enabled by default, stores the cookies from the queried domain, whereas RobotsTxtMiddleware
suppresses the retrieval of documents blocked for crawlers by the robots.txt
[5] file on the web server.
Tracker
Scrapy evaluates the document components by interactively querying and investigating them via the response object, as shown in Figure 2. The selection is made either with the help of CSS selectors [6], as in jQuery, or with XPath expressions [7], as in XSLT. For example, first enter the command,
response.xpath('//title/text()').extract()
as shown in Figure 2 to call the xpath()
method. The //title
subexpression first selects all <title> tags from the HTML document, and /text()
selects the text nodes that follow. The extract()
method transfers the results set to the Python list:
[u'Home \xbb Linux-Magazin']
Using the expression
len(response.xpath('//a/@href').extract())
you can extract the values of the <href> attributes of all <a> tags in a list. Their length is discovered by the Python len()
function; in this case, there are 215 (see Figure 2).
Getting Started
A sample application can be compiled with a little knowledge of Scrapy. The command
scrapy startproject mirror
lays the foundations by creating a matching directory structure (Listing 1). The application reaches the user after changing to the mirror
project directory. Empty files with the name __init__.py
are purely technical in nature [8]. The spider
and pipeline
classes can be found in subdirectories of the same name. Scrapy stores the results in the results
directory and the associated reports in the reports
directory.
Listing 1
mirror Sample Project
A few listings will be added to the skeleton project later. The mirror/utils.py
file from the last line of Listing 1 stores the helper functions. Listing 2 shows the contents of this file.
Listing 2
mirror/utils.py
The global settings for the project are also in Python format and belong in mirror/settings.py
(Listing 3). Scrapy thus creates the variables in the first three lines, their capitalization is reminiscent of constants in C, although they are not available in Python.
Listing 3
mirror/settings.py
Line 1 stores the name Scrapy sends to the requested web server instead of the browser identifier in the header of the HTTP requests. Lines 2 and 3 are inherent in the Scrapy system and require no change. The variables that follow store application-specific constants.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.
-
ZorinOS 17.1 Released, Includes Improved Windows App Support
If you need or desire to run Windows applications on Linux, there's one distribution intent on making that easier for you and its new release further improves that feature.
-
Linux Market Share Surpasses 4% for the First Time
Look out Windows and macOS, Linux is on the rise and has even topped ChromeOS to become the fourth most widely used OS around the globe.
-
KDE’s Plasma 6 Officially Available
KDE’s Plasma 6.0 "Megarelease" has happened, and it's brimming with new features, polish, and performance.
-
Latest Version of Tails Unleashed
Tails 6.0 is based on Debian 12 and includes GNOME 43.
-
KDE Announces New Slimbook V with Plenty of Power and KDE’s Plasma 6
If you're a fan of KDE Plasma, you'll be thrilled to hear they've announced a new Slimbook with an AMD CPU and the latest version of KDE Plasma desktop.
-
Monthly Sponsorship Includes Early Access to elementary OS 8
If you want to get a glimpse of what's in the pipeline for elementary OS 8, just set up a monthly sponsorship to help fund its continued existence.
-
DebConf24 to be Held in South Korea
Busan will be the location of the latest DebConf running July 28 through August 4
-
Fedora Unleashes Atomic Desktops
Fedora has combined its solid distribution with rpm-ostree system to make it possible to deliver a new family of Fedora spins, called Fedora Atomic Desktops.
-
Bootloader Vulnerability Affects Nearly All Linux Distributions
The developers of shim have released a version to fix numerous security flaws, including one that could enable remote control execution of malicious code under certain circumstances.