Build your own crawlers
Transport Workers
Before sending the requested data into the item pipeline (Figure 3, left), Scrapy converts it to an item object. This modification helps to format and validate the application data. The item class of the sample application is mirror/items.py
(Listing 4).
Listing 4
mirror/items.py
Line 1 explicitly imports the required base classes Item
and Field
from the Scrapy package. Line 3 declares the Item
class Attributes
, which points to the parent class Item
in the parentheses. The final colon introduces a block to which all following lines with the same indentation belong.
Scrapy does not define the application data directly in the attributes but stores them in objects of type Field
(lines 4 to 11). The constructor of Field
comprises, as usual, the class name and pair of round brackets.
Thanks to the call parameter, the Field
object first runs the data to be saved through the __call__()
method, to which the passed object of type Split
belongs. The class instantiates the code in lines 13-15. The signature of __call__()
is predefined, as is the case for all methods that Scrapy automatically invokes. The method takes a list of character strings and breaks the more complex list expression down into individual words in line 15.
The first for
loop iterates over all value
items in the values
list. The second loop uses split()
to tackle the divided value
string. With each pass, the script uses the word
variable (to the right of the opening square bracket) to add a word to the resulting list.
Industrious Spider
Within the Scrapy application, the aforementioned spiders manage the show (Figure 3, center). They save part of the application code, which, as shown in Figure 2, calls the fetch()
function in the interactive session and evaluates the response
objects. Listing 5 shows the spider waiting in the mirror/spiders/attr.py
application. In addition to CrawlSpider
shown here, Scrapy offers the generic spiders XMLFeedSpider
and CSVFeedSpider
.
Listing 5
mirror/spiders/attr.py
Lines 1-5 import the needed classes and functions. The rest of the listing defines the spider class, which is derived from CrawlSpider
. The crawler launched with scrapy crawl attr
selects, instantiates, and invokes the spider class by name (i.e., the value of the name
attribute in line 8). Starting from line 9, the object marches recursively through the documents. The rules
attribute typically stores rules, much like a router. It then assigns the URLs to different callback functions. The assignment sets the default route in the first element of the list by passing in a LinkExtractor()
object without a path to the Rule
object. Then callback
defines parse_page()
as the callback function (line 13).
Before continuing, objects of CrawlSpider
type first read all the page addresses one by one from the list in the start_urls
attribute. They hand over the response objects to the callback functions. The allowed_domains
statement in line 11 is optional. It restricts queries to the listed domains.
Immediately in line 13 parse_page()
picks up the response
object as the second argument and puts it in the resp
variable. The next line instantiates the ldr
variable, a container object of ItemLoader
type that can hold other objects. It first takes care of initializing the item
object of type Attributes
handed over in Listing 4.
Copy Shop
Lines 15-22 of Listing 5 copy the values from the response
object to the attributes of the item
object. Line 15 uses the add_value()
method to assign the URL of the response
object to an attribute with the same name in the item
object. Lines 16 to 22 use add_xpath()
to copy document components to the listed attributes based on the XPath expressions passed in.
Line 16 uses //title
to extract all title> tags, and the string()
XPath function retrieves the respective text values. The add_
functions manage their results as lists. Line 17 retrieves all the words from the HTML document in a similar way. The [name()!='title']
expression only chooses the tags not called <title> from the selection of all tags (//*
).
As shown in Figure 2, lines 18 and 19 copy all links and tags to the item
object. The next three lines evaluate the MEDIA_TAGS
, SEMANTIC_TAGS
, and INJECT_TAGS
lists from Listing 3. The join()
helper function from mirror/utils.py
transforms these into XPath expressions. To be more precise, join()
formats the given list of tags ['a', 'b']
as //a | //b
. The pipe symbol represents the OR operator in XPath. The spider uses the get()
method of the settings
objects to find the three lists in an attribute of the same name.
Line 23 ends the initialization of the item
object, by calling the load_item()
method. Empty attributes would cause an error. The return
statement passes the item
object into the item pipeline.
« Previous 1 2 3 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
So Long Neofetch and Thanks for the Info
Today is a day that every Linux user who enjoys bragging about their system(s) will mourn, as Neofetch has come to an end.
-
Ubuntu 24.04 Comes with a “Flaw"
If you're thinking you might want to upgrade from your current Ubuntu release to the latest, there's something you might want to consider before doing so.
-
Canonical Releases Ubuntu 24.04
After a brief pause because of the XZ vulnerability, Ubuntu 24.04 is now available for install.
-
Linux Servers Targeted by Akira Ransomware
A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.