Scraping highly dynamic websites
Programming Snapshot – chromedp

© Lead Image © Rene Walter, 123RF
Screen scrapers often fail when confronted with complex web pages. To keep his scraper on task, Mike Schilli remotely controls the Chrome browser using the DevTools protocol to extract data, even from highly dynamic web pages.
Gone are the days when hobbyists could simply download websites quickly with a curl
command in order to machine-process their content. The problem is that state-of-the-art websites are teeming with reactive design and dynamic content that only appears when a bona fide, JavaScript-enabled web browser points to it.
For example, if you wanted to write a screen scraper for Gmail, you wouldn't even get through the login process with your script. In fact, even a scraping framework like Colly [1] would fail here, because it does not support JavaScript and does not know the browser's DOM (Document Object Model), upon which the web flow relies. One elegant workaround is for the scraper program to navigate a real browser to the desired web page and to inquire later about the content currently displayed.
For years, developers have been using the Java Selenium suite for fully automated unit tests for Web user interfaces (UIs). The tool speaks the Selenium protocol, which is supported by all standard browsers, to get things moving. Google's Chrome browser additionally implements the DevTools protocol [2], which does similar things, and the chromedp project on GitHub [3] defines a Go library based on it. Go enthusiasts can now write their unit tests and scraper programs natively in their favorite language. I'll take a look at some screen-scraping techniques in this article, but keep in mind that many websites have licenses that prohibit screen scraping. See the site's permission page and consult the applicable laws for your jurisdiction.
[...]
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News
-
Linux Kernel 6.17 is Available
Linus Torvalds has announced that the latest kernel has been released with plenty of core improvements and even more hardware support.
-
Kali Linux 2025.3 Released with New Hacking Tools
If you're a Kali Linux fan, you'll be glad to know that the third release of this famous pen-testing distribution is now available with updates for key components.
-
Zorin OS 18 Beta Available for Testing
The latest release from the team behind Zorin OS is ready for public testing, and it includes plenty of improvements to make it more powerful, user-friendly, and productive.
-
Fedora Linux 43 Beta Now Available for Testing
Fedora Linux 43 Beta ships with Gnome 49 and KDE Plasma 6.4 (and other goodies).
-
USB4 Maintainer Leaves Intel
Michael Jamet, one of the primary maintainers of USB4 and Thunderbolt drivers, has left Intel, leaving a gaping hole for the Linux community to deal with.
-
Budgie 10.9.3 Now Available
The latest version of this elegant and configurable Linux desktop aligns with changes in Gnome 49.
-
KDE Linux Alpha Available for Daring Users
It's official, KDE Linux has arrived, but it's not quite ready for prime time.
-
AMD Initiates Graphics Driver Updates for Linux Kernel 6.18
This new AMD update focuses on power management, display handling, and hardware support for Radeon GPUs.
-
AerynOS Alpha Release Available
With a choice of several desktop environments, AerynOS 2025.08 is almost ready to be your next operating system.
-
AUR Repository Still Under DDoS Attack
Arch User Repository continues to be under a DDoS attack that has been going on for more than two weeks.