Scraping highly dynamic websites
Full-Text Search
How can the newly launched browser drone simulate user interactions such as mouse clicks or keyboard strokes to follow more complicated web flows? The chromedp library offers functions to select certain form fields or buttons from the displayed web page's DOM via an XPath query and communicates with them via SendKeys()
, Submit()
, or Click()
.
For example, to fire off a full-text search of all repositories on GitHub, you first need to discover the name of the search field there. A quick look in the Chrome browser's developer view (via the Inspect Elements menu) reveals that the search field goes by the name
attribute q
(Figure 2). Line 15 from Listing 3 defines the associated XPath query //input[@name="q"]
in the sel
variable. After accessing the GitHub page, the WaitVisible()
task in line 19 waits until the search field arrives via the Internet's series of tubes.
Listing 3
github.go
01 package main 02 03 import ( 04 "context" 05 "fmt" 06 "github.com/chromedp/cdproto/dom" 07 cdp "github.com/chromedp/chromedp" 08 ) 09 10 func main() { 11 ctx, cancel := 12 cdp.NewContext(context.Background()) 13 defer cancel() 14 15 sel := `//input[@name="q"]` 16 tasks := cdp.Tasks{ 17 cdp.Navigate( 18 "https://github.com/search"), 19 cdp.WaitVisible(sel), 20 cdp.SendKeys(sel, "waaah\n"), 21 cdp.WaitReady("body", cdp.ByQuery), 22 cdp.ActionFunc( 23 func(ctx context.Context) error { 24 node, err := dom.GetDocument().Do(ctx) 25 if err != nil { 26 panic(err) 27 } 28 res, err := dom.GetOuterHTML(). 29 WithNodeID(node.NodeID).Do(ctx) 30 if err != nil { 31 panic(err) 32 } 33 fmt.Printf("html=%s\n", res) 34 return nil 35 }), 36 } 37 err := cdp.Run(ctx, tasks) 38 if err != nil { 39 panic(err) 40 } 41 }
The SendKeys()
function in line 20 then sends the search string (waaah
) to the input field and terminates it with a newline character. This is enough to prompt the GitHub UI to initiate the search without having to press a Submit button. Line 21 then waits with WaitReady("body", cdp.ByQuery)
until the result is available, before initiating the output of the returned raw HTML with a user-defined ActionFunc()
.
With the content displayed in Figure 3, it then uses dom.GetDocument()
to retrieve the root node of the HTML document and dom.GetOuterHTML()
to get the raw HTML code on the page. For test purposes, the Printf()
function outputs the displayed page's source code; a scraper application could extract interesting content from this and process it further.
Not Entirely Welcome
Chromedp supports all Chrome and Edge browsers, but not Firefox, which is based on a different technology. The chromedp project on GitHub is trying to keep up with new Chrome versions and changes to the DevTools protocol: Sometimes, things don't work as expected, and workarounds have to be found.
There are open issues on the GitHub projects, and the developers are addressing problems with new versions of the library. Besides Google's official documentation [2], there are some tutorials online, but not a thorough and practical publication on the topic. At least, I found a book [5] that contains a huge amount of useless filler material, but also covers various topics in the field of web scraping and briefly discusses Selenium and Chrome DevTools.
Popular websites like Facebook, Twitter, or Google also seem to be interested in making scraping with tools like chromedp as hard as possible. For example, Gmail uses dynamically generated random names for the input fields on the login page. In addition, ongoing changes to the page layout often require a scraper adjustment, so that the whole undertaking remains an arms race between the content provider and the scraper developers. The industry giants want to bind their users to the official pages in order to bombard them with advertising, because, let's face it, somebody has to pay the server bill at the end of the month.
Infos
- "Programming Snapshot – Colly" by Mike Schilli, Linux Magazine, issue 223, June 2019, pp. 54-57
- Chrome DevTools: https://developers.google.com/web/tools/chrome-devtools/
- chromedp: https://github.com/chromedp/chromedp
- Listings for this article: ftp://ftp.linux-magazine.com/pub/listings/linux-magazine.com/234/
- Smith, Vincent. Go Web Scraping Quick Start Guide, Packt Publishing, 2019, https://learning.oreilly.com/library/view/go-web-scraping/9781789615708/cover.xhtml
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.