Build your own crawlers

Spider, Spider

Article from Issue 191/2016

Author(s): Andreas Moller

Scrapy is an open source framework written in Python that lets you build your own crawlers with minimal effort for professional results.

A crawler demonstrates the capabilities of version 1.0 of the Scrapy framework [1] running under Python 2.7 [2]. Scrapy is an open source framework for extracting data from websites. It recursively crawls through HTML documents and follows all the links it finds.

In the spirit of HTML5, the test created in this article is designed to reveal non-semantic markup on websites. The crawler counts the number of words used per page, as well as the number of characteristic tag groups (Table 1), saving the results along with the URL in a database.

To install the required packages, I used the Debian 8 Apt package manager:

[...]

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Download Article PDF now with Express Checkout

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subscriptions

Digital Subscriptions

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Hannah Montana Linux Is Back!

DEBIAN , Kubuntu , Plasma

Developer Noah Cagle decided the world needed the once obscure but beloved Linux distribution and gave it a decidedly pink refresh.
System76 Refreshes the Lemur Laptop

Hardware , laptop

If you're looking for a laptop with tons of power and battery, look no further than the latest iteration of the System76 Lemur Pro.
More than 43 Million Lines of Code in Linux Kernel 7.2

Kernel , Linux

Using the cloc utility, Michael Larabel of Phoronix discovered that Linux kernel 7.2 has over 43 million lines of code.
Kubuntu Focus Goes Ultra

Hardware , Kubuntu , laptop

The Kubuntu Focus team has upped the performance ante of its M2 and Zr laptops with the latest, greatest CPUs from Intel.
Linux Gamers May Soon See Less Mouse Lag in KDE Plasma

Games , KDE , Plasma

Gamers using KDE’s Plasma desktop have been suffering from a slight input delay in mouse movement that could lead to getting fragged.
Three Lines of Code Improve Linux Storage Performance

Kernel , Performance , Storage

A developer changed three lines of code, giving Linux storage performance a 5% bump.
AUR Hit Again with Malicious Packages

Arch Linux , Security

Once again the Arch User Repository is plagued by a high volume of malicious packages.
Alpine Linux 3.24 Features Fresh Desktops and a Newer Kernel

Alpine Linux , Gnome , Plasma , Security

If you're a fan of Alpine Linux, it's time to upgrade because the latest version has been released with KDE Plasma 6.6, Gnome 50, and Linux kernel 6.18 LTS.
EU Open Source Strategy Plays Key Role in Tech Sovereignty Package

EU , government , open source

Comprehensive measures adopted by the European Commission aim to reduce dependency on non-EU countries.
Linux Foundation Report Indicates AI Driving Tech Hiring

Artificial Inte... , privacy , Security

Within growing security and skills gaps, AI has been found to be a positive driving force behind tech hiring trends in Europe.

Build your own crawlers

Spider, Spider

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Hannah Montana Linux Is Back!

System76 Refreshes the Lemur Laptop

More than 43 Million Lines of Code in Linux Kernel 7.2

Kubuntu Focus Goes Ultra

Linux Gamers May Soon See Less Mouse Lag in KDE Plasma

Three Lines of Code Improve Linux Storage Performance

AUR Hit Again with Malicious Packages

Alpine Linux 3.24 Features Fresh Desktops and a Newer Kernel

EU Open Source Strategy Plays Key Role in Tech Sovereignty Package

Linux Foundation Report Indicates AI Driving Tech Hiring

Build your own crawlers

Spider, Spider

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters