Build your own crawlers

Transport Workers

Before sending the requested data into the item pipeline (Figure 3, left), Scrapy converts it to an item object. This modification helps to format and validate the application data. The item class of the sample application is mirror/items.py (Listing 4).

Listing 4

mirror/items.py

 

Line 1 explicitly imports the required base classes Item and Field from the Scrapy package. Line 3 declares the Item class Attributes, which points to the parent class Item in the parentheses. The final colon introduces a block to which all following lines with the same indentation belong.

Scrapy does not define the application data directly in the attributes but stores them in objects of type Field (lines 4 to 11). The constructor of Field comprises, as usual, the class name and pair of round brackets.

Thanks to the call parameter, the Field object first runs the data to be saved through the __call__() method, to which the passed object of type Split belongs. The class instantiates the code in lines 13-15. The signature of __call__() is predefined, as is the case for all methods that Scrapy automatically invokes. The method takes a list of character strings and breaks the more complex list expression down into individual words in line 15.

The first for loop iterates over all value items in the values list. The second loop uses split() to tackle the divided value string. With each pass, the script uses the word variable (to the right of the opening square bracket) to add a word to the resulting list.

Industrious Spider

Within the Scrapy application, the aforementioned spiders manage the show (Figure 3, center). They save part of the application code, which, as shown in Figure 2, calls the fetch() function in the interactive session and evaluates the response objects. Listing 5 shows the spider waiting in the mirror/spiders/attr.py application. In addition to CrawlSpider shown here, Scrapy offers the generic spiders XMLFeedSpider and CSVFeedSpider.

Listing 5

mirror/spiders/attr.py

 

Lines 1-5 import the needed classes and functions. The rest of the listing defines the spider class, which is derived from CrawlSpider. The crawler launched with scrapy crawl attr selects, instantiates, and invokes the spider class by name (i.e., the value of the name attribute in line 8). Starting from line 9, the object marches recursively through the documents. The rules attribute typically stores rules, much like a router. It then assigns the URLs to different callback functions. The assignment sets the default route in the first element of the list by passing in a LinkExtractor() object without a path to the Rule object. Then callback defines parse_page() as the callback function (line 13).

Before continuing, objects of CrawlSpider type first read all the page addresses one by one from the list in the start_urls attribute. They hand over the response objects to the callback functions. The allowed_domains statement in line 11 is optional. It restricts queries to the listed domains.

Immediately in line 13 parse_page() picks up the responseobject as the second argument and puts it in the resp variable. The next line instantiates the ldr variable, a container object of ItemLoader type that can hold other objects. It first takes care of initializing the item object of type Attributes handed over in Listing 4.

Copy Shop

Lines 15-22 of Listing 5 copy the values from the response object to the attributes of the item object. Line 15 uses the add_value() method to assign the URL of the response object to an attribute with the same name in the item object. Lines 16 to 22 use add_xpath() to copy document components to the listed attributes based on the XPath expressions passed in.

Line 16 uses //title to extract all title> tags, and the string() XPath function retrieves the respective text values. The add_ functions manage their results as lists. Line 17 retrieves all the words from the HTML document in a similar way. The [name()!='title'] expression only chooses the tags not called <title> from the selection of all tags (//*).

As shown in Figure 2, lines 18 and 19 copy all links and tags to the item object. The next three lines evaluate the MEDIA_TAGS, SEMANTIC_TAGS, and INJECT_TAGS lists from Listing 3. The join() helper function from mirror/utils.py transforms these into XPath expressions. To be more precise, join() formats the given list of tags ['a', 'b'] as //a | //b. The pipe symbol represents the OR operator in XPath. The spider uses the get() method of the settings objects to find the three lists in an attribute of the same name.

Line 23 ends the initialization of the item object, by calling the load_item() method. Empty attributes would cause an error. The return statement passes the item object into the item pipeline.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Portia

    Are you interested in retrieving stock quotes in machine-readable form off the Internet? No problem: After a few mouse clicks, Portia weaves a command line and wraps the data in JSON format.

  • SpiderOak

    Back up, synchronize, version, and collaborate with SpiderOak.

  • Ruby Web Spiders

    Ruby is a very elegant language,and it’s harmonious – the parts work together effectively. Ruby also significantly reduces a developer’s burden. We’ll show you how to use Ruby to build a quick and simple web spider application.

  • Simile

    The Simile project jump starts the semantic web with a collection of tools for extending semantic information to existing websites.

  • Xidel

    Xidel lets you easily extract and process data from XML, HTML, and JSON documents.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News