Scraping the web for data

Data Harvesting

Article from Issue 233/2020

Author(s): Marco Fioretti

Web scraping lets you automatically download and extract data from websites to build your own database. With a simple scraping script, you can harvest information from the web.

If you are looking to collect data from the Internet for a personal database, your first stop is often a Google search. However, a search for mortgage rates can (in theory) return dozens of pages full of relevant images and data, as well as a lot of irrelevant content. You could visit every web page pulled up by your search and cut and paste the relevant data into your database. Or you could use a web scraper to automatically download and extract raw data from the web pages and reformat it into a table, graph, or spreadsheet on your computer.

Not just big data professionals, but also small business owners, teachers, students, shoppers, or just curious people can use web scraping to do all manner of tasks from researching a PhD thesis to creating a database of local doctors to comparing prices for online shopping. Unless you need to do really complicated stuff with super-optimized performance, web scraping is relatively easy. In this article, I'll show you how web scraping works with some practical examples that use the open source tool Beautiful Soup.

Caveats

Web scraping does have its limits. First, you have to start off with a well-crafted search engine query; web scraping can't replace the initial search. To protect their business and observe legal constraints, search engines deploy anti-scraping features; overcoming them is not worth the time of the occasional web scraper. Instead, Web scraping shines (and is irreplaceable) after you have completed your web search.

[...]

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Download Article PDF now with Express Checkout

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subscriptions

Digital Subscriptions

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Framework Laptop 13 Pro Competes with the Best

Hardware , laptop , Linux

Framework has released what might be considered the MacBook of Linux devices.
The Latest CachyOS Features Supercharged Kernel

Arch Linux , CachyOS , Operating Systems

The latest release of CachyOS brings with it an enhanced version of the latest Linux kernel.
Kernel 7.0 Is a Bit More Rusty

Kernel , Performance , Rust

Linux kernel 7.0 has been released for general availability, with Rust finally getting its due.
France Says "Au Revoir" to Microsoft

Digital Soverei... , Linux , open source

In a move that should surprise no one, France announced plans to reduce its reliance on US technology, and Microsoft Windows is the first to get the boot.
CIQ Releases Compatibility Catalog for Rocky Linux

Enterprise Linux , Linux , Rocky Linux

The company behind Rocky Linux is making an open catalog available to developers, hobbyists, and other contributors, so they can verify and publish compatibility with the CIQ lineup.
KDE Gets Some Resuscitation

KDE , Linux , Plasma

KDE is bringing back two themes that vanished a few years ago, putting a bit more air under its wings.
Ubuntu 26.04 Beta Arrives with Some Surprises

Games , graphics , Ubuntu

Ubuntu 26.04 is almost here, but the beta version has been released, and it might surprise some people.
Ubuntu MATE Dev Leaving After 12 years

projects , Ubuntu , Ubuntu MATE

Martin Wimpress, the maintainer of Ubuntu MATE, is now searching for his successor. Are you the next in line?
Kali Linux Waxes Nostalgic with BackTrack Mode

Kali Linux , Operating Systems , penetration tes...

For those who've used Kali Linux since its inception, the changes with the new release are sure to put a smile on your face.
Gnome 50 Smooths Out NVIDIA GPU Issues

Desktop , Games , Gnome

Gamers rejoice, your favorite pastime just got better with Gnome 50 and NVIDIA GPUs.

Scraping the web for data

Data Harvesting

Caveats

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Framework Laptop 13 Pro Competes with the Best

The Latest CachyOS Features Supercharged Kernel

Kernel 7.0 Is a Bit More Rusty

France Says "Au Revoir" to Microsoft

CIQ Releases Compatibility Catalog for Rocky Linux

KDE Gets Some Resuscitation

Ubuntu 26.04 Beta Arrives with Some Surprises

Ubuntu MATE Dev Leaving After 12 years

Kali Linux Waxes Nostalgic with BackTrack Mode

Gnome 50 Smooths Out NVIDIA GPU Issues

Scraping the web for data

Data Harvesting

Caveats

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters