Build your own crawlers

Pneumatic Tube

The item pipeline continues to process the Item objects; the ITEM_PIPELINES variable configures the pipeline in lines 4-8 of the sample application in Listing 3. It passes each of the Item objects one by one to an object of pipeline classes Words (300), Injections (400), and Attributes (500), then modifies or stores them and pushes them forward.

Listing 6 shows the code for the Words class, which checks and standardizes the words for the following evaluation. Scrapy creates and binds the pipeline objects by calling the process_item() method for each item object (line 2). The method expects the item object as the second argument and the spider object, a reference to the calling spider, as the third object.

Listing 6

mirror/pipelines/normalize.py

 

The next two lines overwrite the values of the keywords and words attributes. In line 4, the filter() function fishes all the items from the item[key] list, for which the lambda function (lambda wd: wd.isalnum()) returns a true value. This operation affects strings of purely alphanumeric characters.

The second lambda function – map() – then formats the words in the resulting set to lowercase. The return statement in the last line hands over the item object to the next pipeline object (Listing 7), which considers the possible effect of foreign content on the current browser session.

Listing 7

mirror/pipelines/filter.py

 

The Injections pipeline class from Listing 7 relies on the helper functions is_absurl() and domain(), which come from mirror/utils.py, to reduce the list of resources bound to foreign content.

To do this, the process_item() method overwrites the attribute using the filtered list, which the list expression in line 5 creates. The first for loop only reads the URL of the page; the second for iterates over the tags to be injected. If the if statement determines that the attribute is an absolute URL different from the current domain, the loop picks up the resource from the other variable. The return statement hands over the item altered in this way to the last link in the pipeline (Listing 8), which reduces the results for later evaluation and stores them in a database.

Listing 8

mirror/pipelines/store.py

 

The Attributes pipeline class (Listing 8, line 6) evaluates the item object and stores the results in a SQLite database file [9]. The free SQL-compatible database framework does not require server processes and supports all common programming languages.

Listing 8 writes all data directly and synchronously into the database file. Line 1 binds the correct driver for Python, and the next three lines import the required functions from the standard packages os.path, time, and mirror/utils.py.

As its second parameter, the __init__() constructor accepts the path with which Python opens the SQLite database. Scrapy uses the from_crawler() class method (lines 10-12) to instantiate an object. A look at the method shows that from_crawler() adds the crawler object to its parameter list as a reference to the settings from Listing 3. First, it reads the value of the RESULTS variable (Listing 3, line 9); then, it adds that value to the constructor call in lines 7 and 8 (Listing 8). Finally, gmtime() and strftime(), in combination with a string formatting variant, generate a timestamp for the file name.

The open_spider() method (lines 14-17) calls the Scrapy engine once only, in the style of a callback function, when creating the spiders. It creates and stores a database connection in the conn attribute in line 15. Specifying isolation_level = None tells the driver to create each SQL statement at once persistently in the database file. Line 16 creates and stores the database cursor object that runs database operations. The operations also include the SQL command that creates the result table:

CREATE TABLE Attributes (url text PRIMARY  KEY, keywords int, words int, relevancy int, tags int, semantics int, medias int, links int, injections int)

The process_item() method from line 22 combines the values of the item object with the Attributes table according to Table 2 using the SQL INSERT command. The question marks are replaced by the values of the next tuple in the parentheses. The len() function counts the length of the parsed lists several times. The helper function optvalue() swaps None values for empty lists; relevance() determines the incidence of all keywords in the remaining text of the website.

Table 2

Interpretation of Acquired Data

Size

Computation

Interpretation

Relevancy

Measure of the credibility of the title

Entropy

(words+semantics)/(words+tags)

Non-semantic tags such as div or span reduce the information content

Expressivity

semantics/tags

Semantic tags improve the functional classification of document components

Richness

medias

Media enrich the content

Reliability

links/words

Links vouch for the credibility

Mutability

injections

External resources alienate the page

Evaluation

As discussed earlier in the article, you launch the crawler at the command line from within the mirror project directory:

scrapy crawl attr

Listing 9 shows the SQL query that generates the report shown in Figure 4 according to Tables 1 and 2. The strength of SQL is revealed in the compact style of expression, which is very similar to sentences. However, converting the types and formatting requires tedious typing.

Listing 9

Report SQL Query

 

Figure 4: The page factors on the basis of the relative frequencies of the appropriate endogenous variables.

The endogenous page factors evaluate a web page from various perspectives. The derived entropy value is a measure of the average information content of the page. The values are worded identically but are not identical to the corresponding terminology from information theory [10].

In the sample application, they describe only how non-semantic tags like div and span dilute the content. If the entropy value were 1, the generic spider would achieve better results. The average of 0.837 in Figure 4 indicates some scope for improvement.

Conclusion

Programming with Scrapy is fun and offers surprising insights. Thanks to the cleverly chosen modularization and good documentation, users can focus on extracting and accumulating data. If you delve deeper into Scrapy, you will also see the multitude of aspects it covers and the professional approach the framework pursues.

Scrapy

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Portia

    Are you interested in retrieving stock quotes in machine-readable form off the Internet? No problem: After a few mouse clicks, Portia weaves a command line and wraps the data in JSON format.

  • SpiderOak

    Back up, synchronize, version, and collaborate with SpiderOak.

  • Ruby Web Spiders

    Ruby is a very elegant language,and it’s harmonious – the parts work together effectively. Ruby also significantly reduces a developer’s burden. We’ll show you how to use Ruby to build a quick and simple web spider application.

  • Simile

    The Simile project jump starts the semantic web with a collection of tools for extending semantic information to existing websites.

  • Xidel

    Xidel lets you easily extract and process data from XML, HTML, and JSON documents.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News