Scraping highly dynamic websites
Programming Snapshot – chromedp
© Lead Image © Rene Walter, 123RF
Screen scrapers often fail when confronted with complex web pages. To keep his scraper on task, Mike Schilli remotely controls the Chrome browser using the DevTools protocol to extract data, even from highly dynamic web pages.
Gone are the days when hobbyists could simply download websites quickly with a curl command in order to machine-process their content. The problem is that state-of-the-art websites are teeming with reactive design and dynamic content that only appears when a bona fide, JavaScript-enabled web browser points to it.
For example, if you wanted to write a screen scraper for Gmail, you wouldn't even get through the login process with your script. In fact, even a scraping framework like Colly [1] would fail here, because it does not support JavaScript and does not know the browser's DOM (Document Object Model), upon which the web flow relies. One elegant workaround is for the scraper program to navigate a real browser to the desired web page and to inquire later about the content currently displayed.
For years, developers have been using the Java Selenium suite for fully automated unit tests for Web user interfaces (UIs). The tool speaks the Selenium protocol, which is supported by all standard browsers, to get things moving. Google's Chrome browser additionally implements the DevTools protocol [2], which does similar things, and the chromedp project on GitHub [3] defines a Go library based on it. Go enthusiasts can now write their unit tests and scraper programs natively in their favorite language. I'll take a look at some screen-scraping techniques in this article, but keep in mind that many websites have licenses that prohibit screen scraping. See the site's permission page and consult the applicable laws for your jurisdiction.
[...]
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Fedora 43 Has Finally Landed
The Fedora Linux developers have announced their latest release, Fedora 43.
-
KDE Unleashes Plasma 6.5
The Plasma 6.5 desktop environment is now available with new features, improvements, and the usual bug fixes.
-
Xubuntu Site Possibly Hacked
It appears that the Xubuntu site was hacked and briefly served up a malicious ZIP file from its download page.
-
LMDE 7 Now Available
Linux Mint Debian Edition, version 7, has been officially released and is based on upstream Debian.
-
Linux Kernel 6.16 Reaches EOL
Linux kernel 6.16 has reached its end of life, which means you'll need to upgrade to the next stable release, Linux kernel 6.17.
-
Amazon Ditches Android for a Linux-Based OS
Amazon has migrated from Android to the Linux-based Vega OS for its Fire TV.
-
Cairo Dock 3.6 Now Available for More Compositors
If you're a fan of third-party desktop docks, then the latest release of Cairo Dock with Wayland support is for you.
-
System76 Unleashes Pop!_OS 24.04 Beta
System76's first beta of Pop!_OS 24.04 is an impressive feat.
-
Linux Kernel 6.17 is Available
Linus Torvalds has announced that the latest kernel has been released with plenty of core improvements and even more hardware support.
-
Kali Linux 2025.3 Released with New Hacking Tools
If you're a Kali Linux fan, you'll be glad to know that the third release of this famous pen-testing distribution is now available with updates for key components.

