Build your own crawlers
Pneumatic Tube
The item pipeline continues to process the Item
objects; the ITEM_PIPELINES
variable configures the pipeline in lines 4-8 of the sample application in Listing 3. It passes each of the Item
objects one by one to an object of pipeline classes Words
(300), Injections
(400), and Attributes
(500), then modifies or stores them and pushes them forward.
Listing 6 shows the code for the Words
class, which checks and standardizes the words for the following evaluation. Scrapy creates and binds the pipeline objects by calling the process_item()
method for each item
object (line 2). The method expects the item
object as the second argument and the spider
object, a reference to the calling spider, as the third object.
Listing 6
mirror/pipelines/normalize.py
The next two lines overwrite the values of the keywords
and words
attributes. In line 4, the filter()
function fishes all the items from the item[key]
list, for which the lambda function (lambda wd: wd.isalnum()
) returns a true value. This operation affects strings of purely alphanumeric characters.
The second lambda function – map()
– then formats the words in the resulting set to lowercase. The return
statement in the last line hands over the item
object to the next pipeline object (Listing 7), which considers the possible effect of foreign content on the current browser session.
Listing 7
mirror/pipelines/filter.py
The Injections
pipeline class from Listing 7 relies on the helper functions is_absurl()
and domain()
, which come from mirror/utils.py
, to reduce the list of resources bound to foreign content.
To do this, the process_item()
method overwrites the attribute using the filtered list, which the list expression in line 5 creates. The first for
loop only reads the URL of the page; the second for
iterates over the tags to be injected. If the if
statement determines that the attribute is an absolute URL different from the current domain, the loop picks up the resource from the other
variable. The return
statement hands over the item altered in this way to the last link in the pipeline (Listing 8), which reduces the results for later evaluation and stores them in a database.
Listing 8
mirror/pipelines/store.py
The Attributes
pipeline class (Listing 8, line 6) evaluates the item object and stores the results in a SQLite database file [9]. The free SQL-compatible database framework does not require server processes and supports all common programming languages.
Listing 8 writes all data directly and synchronously into the database file. Line 1 binds the correct driver for Python, and the next three lines import the required functions from the standard packages os.path
, time
, and mirror/utils.py
.
As its second parameter, the __init__()
constructor accepts the path
with which Python opens the SQLite database. Scrapy uses the from_crawler()
class method (lines 10-12) to instantiate an object. A look at the method shows that from_crawler()
adds the crawler
object to its parameter list as a reference to the settings from Listing 3. First, it reads the value of the RESULTS
variable (Listing 3, line 9); then, it adds that value to the constructor call in lines 7 and 8 (Listing 8). Finally, gmtime()
and strftime()
, in combination with a string formatting variant, generate a timestamp for the file name.
The open_spider()
method (lines 14-17) calls the Scrapy engine once only, in the style of a callback function, when creating the spiders. It creates and stores a database connection in the conn
attribute in line 15. Specifying isolation_level = None
tells the driver to create each SQL statement at once persistently in the database file. Line 16 creates and stores the database cursor
object that runs database operations. The operations also include the SQL command that creates the result table:
CREATE TABLE Attributes (url text PRIMARY KEY, keywords int, words int, relevancy int, tags int, semantics int, medias int, links int, injections int)
The process_item()
method from line 22 combines the values of the item
object with the Attributes
table according to Table 2 using the SQL INSERT
command. The question marks are replaced by the values of the next tuple in the parentheses. The len()
function counts the length of the parsed lists several times. The helper function optvalue()
swaps None
values for empty lists; relevance()
determines the incidence of all keywords
in the remaining text of the website.
Table 2
Interpretation of Acquired Data
Size | Computation | Interpretation |
---|---|---|
Relevancy |
– |
Measure of the credibility of the title |
Entropy |
(words+semantics)/(words+tags) |
Non-semantic tags such as div or span reduce the information content |
Expressivity |
semantics/tags |
Semantic tags improve the functional classification of document components |
Richness |
medias |
Media enrich the content |
Reliability |
links/words |
Links vouch for the credibility |
Mutability |
injections |
External resources alienate the page |
Evaluation
As discussed earlier in the article, you launch the crawler at the command line from within the mirror
project directory:
scrapy crawl attr
Listing 9 shows the SQL query that generates the report shown in Figure 4 according to Tables 1 and 2. The strength of SQL is revealed in the compact style of expression, which is very similar to sentences. However, converting the types and formatting requires tedious typing.
Listing 9
Report SQL Query
The endogenous page factors evaluate a web page from various perspectives. The derived entropy value is a measure of the average information content of the page. The values are worded identically but are not identical to the corresponding terminology from information theory [10].
In the sample application, they describe only how non-semantic tags like div
and span
dilute the content. If the entropy value were 1, the generic spider would achieve better results. The average of 0.837 in Figure 4 indicates some scope for improvement.
Conclusion
Programming with Scrapy is fun and offers surprising insights. Thanks to the cleverly chosen modularization and good documentation, users can focus on extracting and accumulating data. If you delve deeper into Scrapy, you will also see the multitude of aspects it covers and the professional approach the framework pursues.
Scrapy
Infos
- Scrapy framework: http://scrapy.org
- Python: https://python.org
- Python package index: http://pypi.python.org
- Scrapy docs: http://doc.scrapy.org/en/latest/topics/architecture.html
- Robots exclusion standard: https://en.wikipedia.org/wiki/Robots_Exclusion_Standard
- CSS selectors: https://api.jquery.com/category/selectors/
- XPath expressions: https://www.w3.org/TR/xpath/
- Meaning of
__init__.py
: http://stackoverflow.com/questions/448271/what-is-init-py-for - SQLite: http://sqlite.org
- Information theory: https://en.wikipedia.org/wiki/Information_theory
« Previous 1 2 3
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.