Build your own crawlers

Transport Workers

Before sending the requested data into the item pipeline (Figure 3, left), Scrapy converts it to an item object. This modification helps to format and validate the application data. The item class of the sample application is mirror/items.py (Listing 4).

Listing 4

mirror/items.py

Line 1 explicitly imports the required base classes Item and Field from the Scrapy package. Line 3 declares the Item class Attributes, which points to the parent class Item in the parentheses. The final colon introduces a block to which all following lines with the same indentation belong.

Scrapy does not define the application data directly in the attributes but stores them in objects of type Field (lines 4 to 11). The constructor of Field comprises, as usual, the class name and pair of round brackets.

Thanks to the call parameter, the Field object first runs the data to be saved through the __call__() method, to which the passed object of type Split belongs. The class instantiates the code in lines 13-15. The signature of __call__() is predefined, as is the case for all methods that Scrapy automatically invokes. The method takes a list of character strings and breaks the more complex list expression down into individual words in line 15.

The first for loop iterates over all value items in the values list. The second loop uses split() to tackle the divided value string. With each pass, the script uses the word variable (to the right of the opening square bracket) to add a word to the resulting list.

Industrious Spider

Within the Scrapy application, the aforementioned spiders manage the show (Figure 3, center). They save part of the application code, which, as shown in Figure 2, calls the fetch() function in the interactive session and evaluates the response objects. Listing 5 shows the spider waiting in the mirror/spiders/attr.py application. In addition to CrawlSpider shown here, Scrapy offers the generic spiders XMLFeedSpider and CSVFeedSpider.

Listing 5

mirror/spiders/attr.py

Lines 1-5 import the needed classes and functions. The rest of the listing defines the spider class, which is derived from CrawlSpider. The crawler launched with scrapy crawl attr selects, instantiates, and invokes the spider class by name (i.e., the value of the name attribute in line 8). Starting from line 9, the object marches recursively through the documents. The rules attribute typically stores rules, much like a router. It then assigns the URLs to different callback functions. The assignment sets the default route in the first element of the list by passing in a LinkExtractor() object without a path to the Rule object. Then callback defines parse_page() as the callback function (line 13).

Before continuing, objects of CrawlSpider type first read all the page addresses one by one from the list in the start_urls attribute. They hand over the response objects to the callback functions. The allowed_domains statement in line 11 is optional. It restricts queries to the listed domains.

Immediately in line 13 parse_page() picks up the responseobject as the second argument and puts it in the resp variable. The next line instantiates the ldr variable, a container object of ItemLoader type that can hold other objects. It first takes care of initializing the item object of type Attributes handed over in Listing 4.

Copy Shop

Lines 15-22 of Listing 5 copy the values from the response object to the attributes of the item object. Line 15 uses the add_value() method to assign the URL of the response object to an attribute with the same name in the item object. Lines 16 to 22 use add_xpath() to copy document components to the listed attributes based on the XPath expressions passed in.

Line 16 uses //title to extract all title> tags, and the string() XPath function retrieves the respective text values. The add_ functions manage their results as lists. Line 17 retrieves all the words from the HTML document in a similar way. The [name()!='title'] expression only chooses the tags not called <title> from the selection of all tags (//*).

As shown in Figure 2, lines 18 and 19 copy all links and tags to the item object. The next three lines evaluate the MEDIA_TAGS, SEMANTIC_TAGS, and INJECT_TAGS lists from Listing 3. The join() helper function from mirror/utils.py transforms these into XPath expressions. To be more precise, join() formats the given list of tags ['a', 'b'] as //a | //b. The pipe symbol represents the OR operator in XPath. The spider uses the get() method of the settings objects to find the three lists in an attribute of the same name.

Line 23 ends the initialization of the item object, by calling the load_item() method. Empty attributes would cause an error. The return statement passes the item object into the item pipeline.

« Previous 1 2 3 Next »

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

So Long Neofetch and Thanks for the Info

bash , Hardware , scripting , shell , Software , Tools

Today is a day that every Linux user who enjoys bragging about their system(s) will mourn, as Neofetch has come to an end.
Ubuntu 24.04 Comes with a “Flaw"

Linux , Snap , Software , Ubuntu

If you're thinking you might want to upgrade from your current Ubuntu release to the latest, there's something you might want to consider before doing so.
Canonical Releases Ubuntu 24.04

Gnome , Linux , open source , Ubuntu

After a brief pause because of the XZ vulnerability, Ubuntu 24.04 is now available for install.
Linux Servers Targeted by Akira Ransomware

Enterprise Linux , Linux , ransomware , Security

A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

Games , Hardware , laptop , Linux

This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
XZ Gets the All-Clear

Arch Linux , Fedora , Linux , open source , Security , Ubuntu

The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
Canonical Collaborates with Qualcomm on New Venture

Artificial Inte... , Linux , open source , Security , Ubuntu

This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
Kodi 21.0 Open-Source Entertainment Hub Released

audio , Multimedia , Music , open source , streaming video , Video

After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
Linux Usage Increases in Two Key Areas

Games , Linux , open source , Steam

If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
Vulnerability Discovered in xz Libraries

Fedora , Linux , malware , Security

An urgent alert for Fedora 40 has been posted and users should pay attention.

Build your own crawlers

Transport Workers

Industrious Spider

Copy Shop

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

So Long Neofetch and Thanks for the Info

Ubuntu 24.04 Comes with a “Flaw"

Canonical Releases Ubuntu 24.04

Linux Servers Targeted by Akira Ransomware

TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

XZ Gets the All-Clear

Canonical Collaborates with Qualcomm on New Venture

Kodi 21.0 Open-Source Entertainment Hub Released

Linux Usage Increases in Two Key Areas

Vulnerability Discovered in xz Libraries

Build your own crawlers

Transport Workers

Industrious Spider

Copy Shop

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters