Scraping the web for data

Text Scraper

Listing 1 shows part of a text scraper that I used for a micro-encyclopedia project [2] in HTML and ePUB format, whose entries consist of the first two full paragraphs of the corresponding Wikipedia pages. Listing 1 retrieves just those two paragraphs, with all their HTML formatting, from the individual Wikipedia pages.

Listing 1

Scraping Wikipedia Paragraphs

 

I grabbed code for lines 1 to 28 from the web, as it seems to be the standard, well-tested header of most Beautiful Soup-based scraping scripts. I only added the verify=False (line 10) to skip verification of SSL certificates, because I was forced to scrape websites that did not have them.

The code loads Beautiful Soup 4 and the other libraries (requests and contextlib) needed to download web pages and handle errors. The main function simple_get in line 8 tries to download the web page from the URL passed as its argument, using two other functions, is_good_response and log_error. If the download's status code is 200, the page retrieved is not empty and is in HTML format (lines 23 to 25), is_good_response is true, and simple_get returns the whole page (line 12). Otherwise, it calls log_error to report the problem (lines 16 to 18).

My original code for the text scraper reads a full list of Wikipedia pages from a file. In this simplified version (Listing 1), I just pass one URL to simple_get in line 30, saving its result in the wikipedia_page variable. Line 31 calls BeautifulSoup, telling it to parse the whole page with its HTML parser and save a copy in a structured format made to order for scraping, inside the content variable. It is this data structure that makes scraping feasible even for occasional programmers. The find_all construct in line 34, in fact, creates a list of all the page elements that are paragraphs (which are marked with the <p> tag), and the for loop in the same line scans all of them.

I want the first two paragraphs that actually contain text, but Wikipedia pages sometimes have empty paragraphs in unpredictable places. By looking at the page's source code, I discovered that those empty paragraphs have a CSS class mw-empty-elt. If the current paragraph has a class attribute with that value, line 35 detects it, and line 36 moves the loop to the next paragraph. Otherwise, the full HTML code of the paragraph is printed, and the counter is incremented. As soon as the loop has found two non-empty paragraphs, the n counter becomes equal to 3, and the whole script stops in line 39. Of course, I could have detected empty paragraphs by checking the length of their content, but I wanted to show you how to check the value of CSS classes, because these parameters, together with id elements that you can process in the same way, are usually the best markers for navigating HTML code.

Image Scraper

You can also use a web scraper for images. I like to collect copies of ancient maps from areas around Rome. Listing 2 shows how I scraped map images from a local university's website (Figure 1) and saved them on my computer (Figure 2). The complete image scraper script includes lines 1-28 from Listing 1, but Listing 2 omits those lines for brevity. The first three lines load three other Python libraries: re for the regular expression of line 8, urllib to save web pages locally, and os to handle paths on the local filesystem. Lines 5 and 6 are identical to lines 30 and 31 in Listing 1: They grab the web page containing the links to all the images and save its BeautifulSoup representation in the content variable. Even here, the find_all function creates a list of only the elements I want to scrape, for the loop in lines 8 to 12.

Listing 2

Automatic Download of Ancient Maps

 

Figure 1: Map images in their original location on the web (note the image title).
Figure 2: The maps, including image titles, from Figure 1 downloaded to my computer using a simple web scraping script.

In this case, those elements are the ones that contain hyperlinks to JPEG images. Such elements are recognizable because they have an href attribute and a value ending with the .jpg extension. The values of those href attributes, however, are not complete URLs that you can directly download. To get the complete URL, line 9 prepends the website address to each image_link's href value and saves the result into image_url. Line 10, instead, uses the os.path.basename function to set the name that the image's local copy will have on my computer. Now line 11 can log to standard output the file name and the description of the image that is inside its title attribute (Figure 1). Finally, line 12 downloads the current image from the website and saves it into the folder imagedir, with the name built in line 10.

A Wikipedia City Scraper

For a more complex project, I scraped images and text from two different websites and merged them. I needed basic statistics and contact data for all Italian cities, (almost 8,000). I fetched most of the data from the Italian Wikipedia, and the rest from another source.

First, I needed to get all the URLs for the pages I wanted to scrape. The Italian Wikipedia has one separate page for each city, but not one single page with a complete list of almost 8,000 entries. It does provide several types of partial indexes; I chose the alphabetical one shown in Figure 3 as it was the easiest to use for my needs.

Figure 3: The starting point of a city-scraping project: a partial, alphabetical index of Italian cities.

Clicking on any of the links in Figure 3 would open a page as shown in Figure 4, with links to all the city pages for the corresponding letter(s). Consequently, I had to scrape the master index first to get the URLs of the approximately 20 sub-indexes, then scrape each sub-index for the links to the individual city pages, and finally download and parse the individual city pages.

Figure 4: The partial indexes, linked from the page shown in Figure 3, contain all the URLs to the pages of each city.

Additionally, I needed to know what to scrape, exactly, from each of those individual pages (Figure 5). I wanted to scrape the images, when available, of the city flag (Bandiera), crest (Stemma), and view (Panorama). I also needed some textual data, namely the Province (Provincia) of each city, its mayor (Sindaco), elevation (Altitudine), area (Superficie), and residents (Abitanti), plus three other categories (not shown in Figure 5): earthquake risk class (Cl. sismica), patron saint (Patrono), and official website (Sito istituzionale).

Figure 5: The Wikipedia page of a small italian town, with the fields to be scraped highlighted.

Looking at Figure 5, you can see that all the data is available as a cell in the first table of each Wikipedia web page. That table's source code is one messy chunk of HTML. Listing 3 shows just part of the HTML code for Panorama (image) and Altitudine (text).

Listing 3

HTML Source Code from

 

Listing 4 shows the script that helped me grab this mass of data (lines 1-28 as shown in Listing 1 are omitted for brevity). Here, I also use the time library (line 1) to make the script pause 1.5 seconds (line 37) after downloading each page to reduce bandwidth consumption.

Listing 4

Wikipedia City Scraper

 

Line 3 is an array containing all the fields in the table that have the same formatting and can therefore be fetched with one common procedure. Using the methods already described, lines 5 and 6 dump Figure 3's master index inside the auxiliary content variable. Line 8 saves inside the array li (short for "list index") all the elements of the page that are anchors (a) to hyperlinks, placed inside a list element (li) of any unordered list (ul). Looking at that index page, I knew that the only links I wanted were those whose text had the form "Comuni d'Italia <SOME lETTER> ("Comuni d'Italia" means "Municipalities of Italy"). Consequently, the first thing that the loop starting in line 9 does is check if the current element contains that string (line 10). If it does, line 11 builds the full URL of that sub-index. After the log message in line 12, lines 13 and 14 fetch and parse that URL (which is a page formatted like Figure 4) in the usual way.

Figure 4 shows that the links to the individual pages for each city are contained in the first column of the first table of every sub-index page. The rows of that table are loaded and scanned one by one by the loop starting in line 16. The absolute URL of each page is built in line 17 in the same way as line 11, by prepending the absolute URL to the link found in the first cell of the row. Thanks to BeautifulSoup, the notation row.find_all('td')[0].a['href'] is all that is needed to achieve this. The same technique, applied in line 18 to the text attribute of the first cell of each row, returns the name of the city. In the first cell of Figure 4, city_name would have the value Abano Terme and city_url the value highlighted in the bottom left corner of Figure 4.

Once it knows the city page's URL, the script logs what it found (line 19) and downloads that page into another content variable (lines 21 and 22). I can reuse the same variable name (content) in each loop without errors, because each instance is local (i.e., only visible) in that loop.

Line 23 starts creating a new line of what will become the final CSV file with all the data: It writes the city name and the pipe character (|) without a newline (this is the meaning of the end="" statement). The loop in lines 25 to 27 retrieves from the HTML table all the variables listed in the fields array of line 3, and highlighted in Figure 5, by using a very powerful Python construct, lambda functions.

Lambda functions are, in their simplest form, short functions without an explicit name, written inline with the code that needs them. In practice, line 26 uses a lambda function to search and load into a label variable all the cell headers of the page whose text is equal to one of the fields of line 3. Only when that happens is the label variable defined, and this makes the script execute line 27: It loads the table's next data cell of (find_next), which contains the field value, strips it of all the initial and trailing whitespaces, and prints it to standard output, followed by another pipe character.

Lines 28 and 29 use exactly the same technique to retrieve the URL of the city's institutional website (href field in line 29). I needed separate code for this, because this time I extract a link, not the string associated with it.

The last loop of the script applies yet another version of the same basic trick to grab the URLs of the view, crest, and flag images. Lines 31 and 32 skip the provinces' flags and crests, because I only want those of cities. Lines 33 to 35 print the link found inside the HTML image tags, but only for images whose alternate description (alt) includes the words "Veduta", "Stemma", or "Bandiera". Line 36 closes the record for the current city, by adding a newline (\n). When I ran this script and stripped out all the LOG lines, I got a spreadsheet with rows as shown in Listing 5 (for readability, each field is on a separate line and the URLs are shortened).

Listing 5

Output of Wikipedia City Scraper

 

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Bash Web Scraping

    With one line of Bash code, Pete scrapes the web and builds a desktop notification app to get the daily snow report.

  • Collecting Data from Web Pages with OutWit
  • Ssscrape 1.0 Collects Dynamic Web Data

    The Ssscrape tool screen-scrapes data from RSS and Atom feeds, blogs and podcasts. The open source software is now available in version 1.0.

  • Under the Hood

    Screen scrapers often fail when confronted with complex web pages. To keep his scraper on task, Mike Schilli remotely controls the Chrome browser using the DevTools protocol to extract data, even from highly dynamic web pages.

  • Simile

    The Simile project jump starts the semantic web with a collection of tools for extending semantic information to existing websites.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Find SysAdmin Jobs

News