Scraping highly dynamic websites

Programming Snapshot – chromedp

© Lead Image © Rene Walter, 123RF

© Lead Image © Rene Walter, 123RF

Article from Issue 234/2020
Author(s):

Screen scrapers often fail when confronted with complex web pages. To keep his scraper on task, Mike Schilli remotely controls the Chrome browser using the DevTools protocol to extract data, even from highly dynamic web pages.

Gone are the days when hobbyists could simply download websites quickly with a curl command in order to machine-process their content. The problem is that state-of-the-art websites are teeming with reactive design and dynamic content that only appears when a bona fide, JavaScript-enabled web browser points to it.

For example, if you wanted to write a screen scraper for Gmail, you wouldn't even get through the login process with your script. In fact, even a scraping framework like Colly [1] would fail here, because it does not support JavaScript and does not know the browser's DOM (Document Object Model), upon which the web flow relies. One elegant workaround is for the scraper program to navigate a real browser to the desired web page and to inquire later about the content currently displayed.

For years, developers have been using the Java Selenium suite for fully automated unit tests for Web user interfaces (UIs). The tool speaks the Selenium protocol, which is supported by all standard browsers, to get things moving. Google's Chrome browser additionally implements the DevTools protocol [2], which does similar things, and the chromedp project on GitHub [3] defines a Go library based on it. Go enthusiasts can now write their unit tests and scraper programs natively in their favorite language. I'll take a look at some screen-scraping techniques in this article, but keep in mind that many websites have licenses that prohibit screen scraping. See the site's permission page and consult the applicable laws for your jurisdiction.

Directing Chrome

The Go program in Listing 1 [4] launches the Chrome browser, points it at the Linux Magazine web page, and then takes a screenshot of the retrieved content. The whole thing runs at the command line if you type

Listing 1

screenshot.go

01 package main
02
03 import (
04   "context"
05   emu "github.com/chromedp/cdproto/emulation"
06   "github.com/chromedp/cdproto/page"
07   cdp "github.com/chromedp/chromedp"
08   "io/ioutil"
09 )
10
11 func main() {
12   ctx, cancel :=
13     cdp.NewContext(context.Background())
14   defer cancel()
15
16   var buf []byte
17   tasks := cdp.Tasks{
18     cdp.Navigate(
19       "http://linux-magazine.com"),
20     cdp.ActionFunc(
21       func(ctx context.Context) error {
22         _, _, contentSize, err :=
23           page.GetLayoutMetrics().Do(ctx)
24         if err != nil {
25           panic(err)
26         }
27
28         w, h := contentSize.Width,
29           contentSize.Height
30
31         viewPortFix(ctx, int64(w), int64(h))
32
33         buf, err = page.CaptureScreenshot().
34           WithQuality(90).
35           WithClip(&page.Viewport{
36             X:      contentSize.X,
37             Y:      contentSize.Y,
38             Width:  w,
39             Height: h,
40             Scale:  1,
41           }).Do(ctx)
42         if err != nil {
43           panic(err)
44         }
45         return nil
46       })}
47
48   err := cdp.Run(ctx, tasks)
49   if err != nil {
50     panic(err)
51   }
52
53   err = ioutil.WriteFile("screenshot.png",
54     buf, 0644)
55   if err != nil {
56     panic(err)
57   }
58 }
59
60 func viewPortFix(
61   ctx context.Context, w, h int64) {
62   err := emu.SetDeviceMetricsOverride(
63     w, h, 1, false).
64     WithScreenOrientation(
65       &emu.ScreenOrientation{
66         Type:
67         emu.OrientationTypePortraitPrimary,
68         Angle: 0,
69       }).
70     Do(ctx)
71
72   if err != nil {
73     panic(err)
74   }
75 }
go build screenshot.go

followed by ./screenshot. The user will not see a browser pop up, because chromedp normally runs in headless (i.e., invisible) mode, unless otherwise configured. The following command gets the required library code from GitHub and also compiles and installs it:

$ go get -u github.com/chromedp/chromedp

It takes the compiled program in Listing 1 a few seconds to retrieve the page, depending on your Internet connection and the current server speed; then it saves an image file in PNG format named screenshot.png to the hard disk as a result. Since the Linux Magazine homepage fills several browser pages in terms of length, giving users a reason to scroll down and explore, the screenshot in Figure 1 is almost 3000 pixels tall.

Figure 1: A screenshot of the Linux Magazine cover page in a remote-controlled Chrome instance.

Listing 1 creates a new chromedp context in line 13 and gives the constructor a standard Go background context, which is an auxiliary construct for controlling Go routines and subroutines. A context constructor in Go returns a cancel() function. This function can be called by the main program later to signal to another (maybe deeply) nested part of the program that it is time to clean up, because doors are being closed.

The Tasks structure starting on line 17 defines a set of actions that you want the connected Chrome browser to perform, using the DevTools protocol. The Navigate task starting on line 18 directs the browser to the Linux Magazine website. The second task starting in line 20 is created by the ActionFunc() function, a tool to structure new customized tasks in chromedp. In this case, the task creates a screenshot of the web page displayed in the remote browser using the function CaptureScreenshot() in line 33.

Wide Open Spaces

Now the question is how far to open the virtual browser, because this setting determines what you see in the screenshot. Is only a fraction of the web page visible or all of it, including the parts that can only be reached by scrolling? If it's the latter, the screenshot needs to capture everything that the user would see if they had an infinitely tall screen with the browser fully extended.

To capture it all, the GetLayoutMetrics() function calculates the layout dimensions of the displayed page, and the viewPortFix() function (called in line 31 and defined in line 60) uses SetDeviceMetricsOverride() to adjust the dimensions of the invisible browser. The buf image buffer returned by the Screenshot function in line 33 is written to disk in PNG format by WriteFile(). The sequence of the actions, starting with navigating to the page, followed by taking the screenshot, is processed by the Run() function starting in line 48.

The technique of creating screenshots of automatically fetched web pages opens up a number of unheard-of possibilities when testing newly developed web UIs. For example, image recognition can later determine whether the site's various graphic elements are in the right place with different browser sizes, without human test personnel having to click their way through the flow with every release. It could also be used to implement a neat system for archiving websites; in the next century, historians would surely be amused by the advertisements placed on the Linux Magazine homepage in 2020.

Complicating Easy Things

For test purposes, it would be quite useful at times to start the remote browser visibly in the foreground instead of hidden in the background. Developers of scraping applications can thus determine if the browser is stepping through or if it gets stuck at some point. Paradoxically, however, setting up foreground mode has become quite complicated since the introduction of default background mode in chromedp some time ago, since using NewContext() to create a new browser context configures the browser to run in background mode deep down in the library's engine compartment, which is inaccessible from outside.

This is why Listing 2 creates a new browser controller in the form of NewExecAllocator() and passes it the NoFirstRun option to make the browser run in the foreground. Back comes a context, but, alas, not a context compatible with the context object that chromedp uses and gives to Run() in line 24 of the executing function. Therefore, line 12 creates a compatible context via NewContext() and passes it the previously created Exec context as a parent context. The new chromedp context also has a cancel() function, and the defer statements in lines 13 and 14 are both triggered at the end of the program to neatly collapse the remote-controlled browser.

Listing 2

foreground.go

01 package main
02
03 import (
04   "context"
05   cdp "github.com/chromedp/chromedp"
06   "time"
07 )
08
09 func main() {
10   pctx, pcancel := cdp.NewExecAllocator(
11     context.Background(), cdp.NoFirstRun)
12   ctx, cancel := cdp.NewContext(pctx)
13   defer cancel()
14   defer pcancel()
15
16   tasks := cdp.Tasks{
17     cdp.Navigate(
18       "https://linux-magazin.de"),
19     cdp.Navigate(
20       "http://linux-magazine.com"),
21     cdp.Sleep(5 * time.Second),
22   }
23
24   err := cdp.Run(ctx, tasks)
25   if err != nil {
26     panic(err)
27   }
28 }

Listing 2 only accesses the homepages of the German and English versions of Linux Magazine for this test; it then Sleep()s for five seconds and terminates.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Data Scraper

    The Colly scraper helps developers who work with the Go programming language to collect data off the web. Mike Schilli illustrates the capabilities of this powerful tool with a few practical examples.

  • Collecting Data from Web Pages with OutWit
  • Servile Guardian

    What is making the lights on the router flicker so excitedly? An intruder? We investigate with pfSense on a Protectli micro appliance and a screen scraper to email the information.

  • Simile

    The Simile project jump starts the semantic web with a collection of tools for extending semantic information to existing websites.

  • Better Safe than Sorry

    Developers cannot avoid unit testing if they want their Go code to run reliably. Mike Schilli shows how to test even without an Internet or database connection, by mocking and injecting dependencies.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News