Scraping highly dynamic websites
Programming Snapshot – chromedp

© Lead Image © Rene Walter, 123RF
Screen scrapers often fail when confronted with complex web pages. To keep his scraper on task, Mike Schilli remotely controls the Chrome browser using the DevTools protocol to extract data, even from highly dynamic web pages.
Gone are the days when hobbyists could simply download websites quickly with a curl
command in order to machine-process their content. The problem is that state-of-the-art websites are teeming with reactive design and dynamic content that only appears when a bona fide, JavaScript-enabled web browser points to it.
For example, if you wanted to write a screen scraper for Gmail, you wouldn't even get through the login process with your script. In fact, even a scraping framework like Colly [1] would fail here, because it does not support JavaScript and does not know the browser's DOM (Document Object Model), upon which the web flow relies. One elegant workaround is for the scraper program to navigate a real browser to the desired web page and to inquire later about the content currently displayed.
For years, developers have been using the Java Selenium suite for fully automated unit tests for Web user interfaces (UIs). The tool speaks the Selenium protocol, which is supported by all standard browsers, to get things moving. Google's Chrome browser additionally implements the DevTools protocol [2], which does similar things, and the chromedp project on GitHub [3] defines a Go library based on it. Go enthusiasts can now write their unit tests and scraper programs natively in their favorite language. I'll take a look at some screen-scraping techniques in this article, but keep in mind that many websites have licenses that prohibit screen scraping. See the site's permission page and consult the applicable laws for your jurisdiction.
[...]
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News
-
System76 Releases COSMIC Alpha 7
With scores of bug fixes and a really cool workspaces feature, COSMIC is looking to soon migrate from alpha to beta.
-
OpenMandriva Lx 6.0 Available for Installation
The latest release of OpenMandriva has arrived with a new kernel, an updated Plasma desktop, and a server edition.
-
TrueNAS 25.04 Arrives with Thousands of Changes
One of the most popular Linux-based NAS solutions has rolled out the latest edition, based on Ubuntu 25.04.
-
Fedora 42 Available with Two New Spins
The latest release from the Fedora Project includes the usual updates, a new kernel, an official KDE Plasma spin, and a new System76 spin.
-
So Long, ArcoLinux
The ArcoLinux distribution is the latest Linux distribution to shut down.
-
What Open Source Pros Look for in a Job Role
Learn what professionals in technical and non-technical roles say is most important when seeking a new position.
-
Asahi Linux Runs into Issues with M4 Support
Due to Apple Silicon changes, the Asahi Linux project is at odds with adding support for the M4 chips.
-
Plasma 6.3.4 Now Available
Although not a major release, Plasma 6.3.4 does fix some bugs and offer a subtle change for the Plasma sidebar.
-
Linux Kernel 6.15 First Release Candidate Now Available
Linux Torvalds has announced that the release candidate for the final release of the Linux 6.15 series is now available.
-
Akamai Will Host kernel.org
The organization dedicated to cloud-based solutions has agreed to host kernel.org to deliver long-term stability for the development team.