Build your own crawlers

Spider, Spider

Article from Issue 191/2016

Author(s): Andreas Moller

Scrapy is an open source framework written in Python that lets you build your own crawlers with minimal effort for professional results.

A crawler demonstrates the capabilities of version 1.0 of the Scrapy framework [1] running under Python 2.7 [2]. Scrapy is an open source framework for extracting data from websites. It recursively crawls through HTML documents and follows all the links it finds.

In the spirit of HTML5, the test created in this article is designed to reveal non-semantic markup on websites. The crawler counts the number of words used per page, as well as the number of characteristic tag groups (Table 1), saving the results along with the URL in a database.

To install the required packages, I used the Debian 8 Apt package manager:

[...]

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

USB4 Maintainer Leaves Intel

Community , Kernel , Linux

Michael Jamet, one of the primary maintainers of USB4 and Thunderbolt drivers, has left Intel, leaving a gaping hole for the Linux community to deal with.
Budgie 10.9.3 Now Available

Budgie , Desktop , Linux

The latest version of this elegant and configurable Linux desktop aligns with changes in Gnome 49.
KDE Linux Alpha Available for Daring Users

KDE Linux , open source , Operating Systems

It's official, KDE Linux has arrived, but it's not quite ready for prime time.
AMD Initiates Graphics Driver Updates for Linux Kernel 6.18

graphics , Linux

This new AMD update focuses on power management, display handling, and hardware support for Radeon GPUs.
AerynOS Alpha Release Available

AerynOS , Linux

With a choice of several desktop environments, AerynOS 2025.08 is almost ready to be your next operating system.
AUR Repository Still Under DDoS Attack

Arch Linux , Security

Arch User Repository continues to be under a DDoS attack that has been going on for more than two weeks.
RingReaper Malware Poses Danger to Linux Systems

Linux , malware , Security

A new kind of malware exploits modern Linux kernels for I/O operations.
Happy Birthday, Linux

Linux , open source , Operating Systems

On August 25, Linux officially turns 34.
VirtualBox 7.2 Has Arrived

Kernel , Linux , Virtualization

With early support for Linux kernel 6.17 and other new additions, VirtualBox 7.2 is a must-update for users.
Linux Mint 22.2 Beta Available for Testing

Linux mint , Operating Systems , Wayland

Some interesting new additions and improvements are coming to Linux Mint. Check out the Linux Mint 22.2 Beta to give it a test run.

Build your own crawlers

Spider, Spider

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

USB4 Maintainer Leaves Intel

Budgie 10.9.3 Now Available

KDE Linux Alpha Available for Daring Users

AMD Initiates Graphics Driver Updates for Linux Kernel 6.18

AerynOS Alpha Release Available

AUR Repository Still Under DDoS Attack

RingReaper Malware Poses Danger to Linux Systems

Happy Birthday, Linux

VirtualBox 7.2 Has Arrived

Linux Mint 22.2 Beta Available for Testing

Build your own crawlers

Spider, Spider

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters