Build your own crawlers

Spider, Spider

© Lead Image © mtkang, 123RF.com

© Lead Image © mtkang, 123RF.com

Article from Issue 191/2016
Author(s):

Scrapy is an open source framework written in Python that lets you build your own crawlers with minimal effort for professional results.

A crawler demonstrates the capabilities of version 1.0 of the Scrapy framework [1] running under Python 2.7 [2]. Scrapy is an open source framework for extracting data from websites. It recursively crawls through HTML documents and follows all the links it finds.

In the spirit of HTML5, the test created in this article is designed to reveal non-semantic markup on websites. The crawler counts the number of words used per page, as well as the number of characteristic tag groups (Table 1), saving the results along with the URL in a database.

Table 1

Definition of Stored Metrics

Metric

Meaning

keywords

Number of words in the <title> tag

words

Number of all words except keywords

relevancy

Frequency of keywords in the total number of words

tags

Total number of all tags

semantics

Total number of all semantic tags

links

Total number of all <a> tags with an <href> attribute

injections

Number of third-party resources

To install the required packages, I used the Debian 8 Apt package manager:

apt-get install python-pip libxslt1-dev python-dev python-lxml

The packages include the Python package manager (Pip), the libxslt library along with the header files, the Python header files, and the Python bindings for libxml and libxslt. Because Debian 8 comes with Python 2.7 and libxml pre-installed, you can install Scrapy as follows:

pip install scrapy

Unlike Apt, Pip installs as the latest Scrapy version for Python 2.7 from the Python package index [3].

Test Run

To begin, open an interactive session in the Scrapy shell by entering scrapy shell (Figure 1). Next, send a command to the Scrapy engine to tell the on-board downloader to read the German Linux-Magazin homepage through an HTTP request and transfer the results to the response object (Figure 2):

Figure 1: In the Scrapy shell, you can test commands interactively.
Figure 2: The on-board downloader bundles the Linux-Magazin site into a response object.
fetch('http://www.linux-magazin.de')

Figure 3 demonstrates in detail how the components of the Scrapy architecture work together. This illustration makes it clear that the engine does not talk directly to the downloaders but first passes the HTTP request to the scheduler (Figure 3, top). The downloader middleware (Figure 3, center right) modifies the HTTP request before deployment. CookiesMiddleware, which is enabled by default, stores the cookies from the queried domain, whereas RobotsTxtMiddleware suppresses the retrieval of documents blocked for crawlers by the robots.txt [5] file on the web server.

Figure 3: The Scrapy engine delegates tasks to different components, like the spider, the item pipelines, and middleware [4]. Twisted, an event-driven network framework, works in the background.

Tracker

Scrapy evaluates the document components by interactively querying and investigating them via the response object, as shown in Figure 2. The selection is made either with the help of CSS selectors [6], as in jQuery, or with XPath expressions [7], as in XSLT. For example, first enter the command,

response.xpath('//title/text()').extract()

as shown in Figure 2 to call the xpath() method. The //title subexpression first selects all <title> tags from the HTML document, and /text() selects the text nodes that follow. The extract() method transfers the results set to the Python list:

[u'Home \xbb Linux-Magazin']

Using the expression

len(response.xpath('//a/@href').extract())

you can extract the values of the <href> attributes of all <a> tags in a list. Their length is discovered by the Python len() function; in this case, there are 215 (see Figure 2).

Getting Started

A sample application can be compiled with a little knowledge of Scrapy. The command

scrapy startproject mirror

lays the foundations by creating a matching directory structure (Listing 1). The application reaches the user after changing to the mirror project directory. Empty files with the name __init__.py are purely technical in nature [8]. The spider and pipeline classes can be found in subdirectories of the same name. Scrapy stores the results in the results directory and the associated reports in the reports directory.

Listing 1

mirror Sample Project

 |- scrapy.cfg
 |- mirror:
   |- __init__.py
   |- items.py
   |- pipelines
     |- __init__.py
     |- filter.py
     |- normalize.py
     |- store.py
   |- reports
     |- attr.py
   |- results
     |- 2016032210001458637243.sqlite3
   |- settings.py
   |- spiders
     |- __init__.py
     |- attr.py
   |- utils.py

A few listings will be added to the skeleton project later. The mirror/utils.py file from the last line of Listing 1 stores the helper functions. Listing 2 shows the contents of this file.

Listing 2

mirror/utils.py

01 from urlparse import urlparse
02
03 def optvalue(alist, key, default=[]):
04   if key in alist:
05     return alist[key]
06   return default
07
08 def domain(url):
09   return urlparse(url).netloc
10
11 def join(tags):
12   return "|".join(['//'+tag for tag in tags])
13
14 def relevance(kds, wds):
15   if len(kds) == 0 or len(wds) == 0:
16     return 0
17   return reduce(lambda acc, kw: float(wds.count(kw)) + acc, kds, 0)/len(kds + wds)
18
19 def is_absurl(url):
20   return reduce(lambda acc, p: url.startswith(p) or acc, [u'http://', u'https://', u'//'], False)

The global settings for the project are also in Python format and belong in mirror/settings.py (Listing 3). Scrapy thus creates the variables in the first three lines, their capitalization is reminiscent of constants in C, although they are not available in Python.

Listing 3

mirror/settings.py

01 BOT_NAME = 'mirror'
02 SPIDER_MODULES = ['mirror.spiders']
03 NEWSPIDER_MODULE = 'mirror.spiders'
04 ITEM_PIPELINES = {
05     'mirror.pipelines.normalize.Words': 300,
06     'mirror.pipelines.filter.Injections': 400,
07     'mirror.pipelines.store.Attributes': 500
08 }
09 RESULTS = 'mirror/results/'
10 MEDIA_TAGS = ['video', 'audio', 'img', 'canvas']
11 INJECT_TAGS = ['script/@src', 'img/@src', 'video/@src', 'audio/@src', 'iframe/@src', 'embed/@src', 'link/@href']
12 SEMANTIC_TAGS = ['html', 'head', 'title', 'meta', 'link', 'body', 'header', 'footer', 'nav', 'article', 'aside', 'section', 'h1', 'h2', 'h3', 'h4', 'p', 'a', 'ul', 'ol', 'li', 'dl', 'dt', 'figure', 'table', 'th', 'tr', 'td', 'video', 'audio', 'form', 'input',  'label', 'button']

Line 1 stores the name Scrapy sends to the requested web server instead of the browser identifier in the header of the HTTP requests. Lines 2 and 3 are inherent in the Scrapy system and require no change. The variables that follow store application-specific constants.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Portia

    Are you interested in retrieving stock quotes in machine-readable form off the Internet? No problem: After a few mouse clicks, Portia weaves a command line and wraps the data in JSON format.

  • Ruby Web Spiders

    Ruby is a very elegant language,and it’s harmonious – the parts work together effectively. Ruby also significantly reduces a developer’s burden. We’ll show you how to use Ruby to build a quick and simple web spider application.

  • SpiderOak

    Back up, synchronize, version, and collaborate with SpiderOak.

  • Simile

    The Simile project jump starts the semantic web with a collection of tools for extending semantic information to existing websites.

  • Perl: XML parsers

    XML is one of today’s most popular data exchange formats. Perl has a huge collection of methods for handling XML. This month’s Perl column discusses the pros and cons of the most common XML modules to help you choose the best tool for your job.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News

njobs Europe
What:
Where:
Country:
Njobs Netherlands Njobs Deutschland Njobs United Kingdom Njobs Italia Njobs France Njobs Espana Njobs Poland
Njobs Austria Njobs Denmark Njobs Belgium Njobs Czech Republic Njobs Mexico Njobs India Njobs Colombia