Scraping highly dynamic websites
Programming Snapshot – chromedp

© Lead Image © Rene Walter, 123RF
Screen scrapers often fail when confronted with complex web pages. To keep his scraper on task, Mike Schilli remotely controls the Chrome browser using the DevTools protocol to extract data, even from highly dynamic web pages.
Gone are the days when hobbyists could simply download websites quickly with a curl
command in order to machine-process their content. The problem is that state-of-the-art websites are teeming with reactive design and dynamic content that only appears when a bona fide, JavaScript-enabled web browser points to it.
For example, if you wanted to write a screen scraper for Gmail, you wouldn't even get through the login process with your script. In fact, even a scraping framework like Colly [1] would fail here, because it does not support JavaScript and does not know the browser's DOM (Document Object Model), upon which the web flow relies. One elegant workaround is for the scraper program to navigate a real browser to the desired web page and to inquire later about the content currently displayed.
For years, developers have been using the Java Selenium suite for fully automated unit tests for Web user interfaces (UIs). The tool speaks the Selenium protocol, which is supported by all standard browsers, to get things moving. Google's Chrome browser additionally implements the DevTools protocol [2], which does similar things, and the chromedp project on GitHub [3] defines a Go library based on it. Go enthusiasts can now write their unit tests and scraper programs natively in their favorite language. I'll take a look at some screen-scraping techniques in this article, but keep in mind that many websites have licenses that prohibit screen scraping. See the site's permission page and consult the applicable laws for your jurisdiction.
[...]
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Direct Download
Read full article as PDF:
Price $2.95
News
-
Mageia 8 is Now Available with Linux 5.10 LTS
The latest release of Mageia includes improved graphics support for both AMD and NVIDIA GPUs.
-
GNOME 40 Beta has been Released
Anyone looking to test the beta for the upcoming GNOME 40 release can now do so.
-
OpenMandriva Lx 4.2 has Arrived
The latest stable version of OpenMandriva has been released and offers the newest KDE desktop and ARM support.
-
Thunderbird 78 is being ported to Ubuntu 20.04
The Ubuntu developers have made the decision to port the latest release of Thunderbird to the LTS version of the platform.
-
Elementary OS is Bringing Multi-Touch Gestures to the OS
User-friendly Linux distribution, elementary OS, is working to make using the fan-favorite platform even better for laptops.
-
Decade-Old Sudo Flaw Discovered
A vulnerability has been discovered in the Linux sudo command that’s been hiding in plain sight.
-
Another New Linux Laptop has Arrived
Slimbook has released a monster of a Linux gaming laptop.
-
Mozilla VPN Now Available for Linux
The promised subscription-based VPN service from Mozilla is now available for the Linux platform.
-
Wayland and New App Menu Coming to KDE
The 2021 roadmap for the KDE desktop environment includes some exciting features and improvements.
-
Deepin 20.1 has Arrived
Debian-based Deepin 20.1 has been released with some interesting new features.