Build your own crawlers

Spider, Spider

Article from Issue 191/2016

Author(s): Andreas Moller

Scrapy is an open source framework written in Python that lets you build your own crawlers with minimal effort for professional results.

A crawler demonstrates the capabilities of version 1.0 of the Scrapy framework [1] running under Python 2.7 [2]. Scrapy is an open source framework for extracting data from websites. It recursively crawls through HTML documents and follows all the links it finds.

In the spirit of HTML5, the test created in this article is designed to reveal non-semantic markup on websites. The crawler counts the number of words used per page, as well as the number of characteristic tag groups (Table 1), saving the results along with the URL in a database.

To install the required packages, I used the Debian 8 Apt package manager:

[...]

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Valve Announces Pending Release of Steam Machine

Games , Linux , Steam

Shout it to the heavens: Steam Machine, powered by Linux, is set to arrive in 2026.
Happy Birthday, ADMIN Magazine!

Administration , HPC , Security

ADMIN is celebrating its 15th anniversary with issue #90.
Another Linux Malware Discovered

Linux , malware , Virtualization

Russian hackers use Hyper-V to hide malware within Linux virtual machines.
TUXEDO Computers Announces a New InfinityBook

Hardware , laptop , Linux

TUXEDO Computers is at it again with a new InfinityBook that will meet your professional and gaming needs.
SUSE Dives into the Agentic AI Pool

Artificial Inte... , Enterprise Linux , monitoring

SUSE becomes the first open source company to adopt agentic AI with SUSE Enterprise Linux 16.
Linux Now Runs Most Windows Games

Games , Linux , Steam

The latest data shows that nearly 90 percent of Windows games can be played on Linux.
Fedora 43 Has Finally Landed

Fedora , Gnome , Operating Systems

The Fedora Linux developers have announced their latest release, Fedora 43.
KDE Unleashes Plasma 6.5

Flatpak , KDE , Plasma

The Plasma 6.5 desktop environment is now available with new features, improvements, and the usual bug fixes.
Xubuntu Site Possibly Hacked

Linux , Security , Xubuntu

It appears that the Xubuntu site was hacked and briefly served up a malicious ZIP file from its download page.
LMDE 7 Now Available

Cinnamon , DEBIAN , Linux mint

Linux Mint Debian Edition, version 7, has been officially released and is based on upstream Debian.

Build your own crawlers

Spider, Spider

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Valve Announces Pending Release of Steam Machine

Happy Birthday, ADMIN Magazine!

Another Linux Malware Discovered

TUXEDO Computers Announces a New InfinityBook

SUSE Dives into the Agentic AI Pool

Linux Now Runs Most Windows Games

Fedora 43 Has Finally Landed

KDE Unleashes Plasma 6.5

Xubuntu Site Possibly Hacked

LMDE 7 Now Available

Build your own crawlers

Spider, Spider

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters