Scraping the web for data

Data Harvesting

© Lead Image © faithie, 123RF.com

© Lead Image © faithie, 123RF.com

Article from Issue 233/2020
Author(s):

Web scraping lets you automatically download and extract data from websites to build your own database. With a simple scraping script, you can harvest information from the web.

If you are looking to collect data from the Internet for a personal database, your first stop is often a Google search. However, a search for mortgage rates can (in theory) return dozens of pages full of relevant images and data, as well as a lot of irrelevant content. You could visit every web page pulled up by your search and cut and paste the relevant data into your database. Or you could use a web scraper to automatically download and extract raw data from the web pages and reformat it into a table, graph, or spreadsheet on your computer.

Not just big data professionals, but also small business owners, teachers, students, shoppers, or just curious people can use web scraping to do all manner of tasks from researching a PhD thesis to creating a database of local doctors to comparing prices for online shopping. Unless you need to do really complicated stuff with super-optimized performance, web scraping is relatively easy. In this article, I'll show you how web scraping works with some practical examples that use the open source tool Beautiful Soup.

Caveats

Web scraping does have its limits. First, you have to start off with a well-crafted search engine query; web scraping can't replace the initial search. To protect their business and observe legal constraints, search engines deploy anti-scraping features; overcoming them is not worth the time of the occasional web scraper. Instead, Web scraping shines (and is irreplaceable) after you have completed your web search.

Web page complexity, which has increased over the past 10 to 15 years, also affects web scraping. Due to this complexity, determining the relevant parts of a web page's source code and how to process it has become more time consuming despite the great progress made by web scraping tools.

Dynamic content poses another problem for web scraping. Today, most pages continuously refresh, changing layout from one moment to the next, and are customized for each visitor. This makes the scraping code more complicated, without providing, sometimes, the certainty that the scraper will extract exactly what you would see in your browser. This is particularly problematic for online shopping. Not only does the price change frequently, but the price also depends on many independent factors, from shipping costs to your buyer profile, shopping history, or preferred payment method – all of which are just outside of web scraping's radar. Consequently, the best deal found by an ordinary scraping script may be quite different from what you would be offered when clicking on the corresponding link. (Unless you spent so much time and effort on tweaking the web scraper that you wouldn't have time left to enjoy your purchases!)

Additionally, many web pages only display properly after some JavaScript code has been downloaded and run or after some interaction with the user. Others, like Tumblr blogs, run an "infinite scroll" function. In all of these cases, a scraper may start parsing the code before it is ready for viewing, thus failing to deliver what you would see in your browser.

Changing HTML tags is yet another issue. Scraping works by recognizing certain HTML tags, with certain labels, in the web page you want to scrape. The label names may change after a software upgrade or adoption of a different graphic theme, resulting in your scraper script failing until you update its code accordingly.

Scraping can consume a lot of bandwidth, which may create real problems for the website you are scraping. To remedy this, make your scrapers as slow as possible and scrape only when there is no alternative (i.e., when webmasters don't provide direct access to the data you want), and everybody will be happy.

Finally, a website's business needs, as well as copyright and other legal issues, stand in the way of web scraping. Many webmasters try their best to block automated scraping of their content. This article does not attempt to address the copyright issues related to screen scraping, which can vary with the site requirements and jurisdiction.

Web Scraping Steps

In spite of these caveats, web scraping remains an immensely useful activity (and if you ask me, a really fun one) in many real world cases. In practice, every web scraping project goes through the same five steps: discovery, page analysis, automatic download, data extraction, and data archival.

In the discovery phase, you search for the pages you want to scrape via a search engine or by simply looking at a website's home page.

During page analysis, you study the HTML code of your selected pages to determine the location of the desired elements and how a scraping script might recognize them. Most of the time, HTML tags' id and class attributes are the easiest to detect and use for web scraping, but that is not always the case. Only visual inspection can confirm this and give you the names of those attributes. You can get a web page's HTML code by saving the complete web page on your computer or right-clicking on View Page Source (or View Selection source for a selected paragraph) in your browser.

The final three steps (automatic download, data extraction, and data archival) involve writing a script that will actually download the page(s), find the right data inside them, and write them to an external file for later processing. The most common format, which I use in two of my examples, is CSV (Comma-Separated-Values), which is a plain text file with one record per line and fields separated by commas or other predefined characters (I prefer pipes). JSON (JavaScript Object Notation) is another popular choice that is more efficient for certain applications.

Remember to keep both your code and your output data as simple as possible. For most web scraping activities, you will only grab the data once, but may then spend days or months analyzing it. Consequently, it doesn't make sense to optimize the scraping code. For instance, a program that will run once while you sleep (even if it takes a whole night to finish) isn't worth spending two hours of your time to optimize. In terms of your output data, it's difficult to know in advance all the possible ways you may want to process it. Therefore, just make sure that the extracted data is correct and laid out in a structured way when you save it. You can always reformat the data later, if necessary.

Beautiful Soup

Currently, the most popular open source web scraping tool is Beautiful Soup [1], a Python library for extracting data out of HTML and XML files. It has a large user community and very good documentation. If Beautiful Soup 4 is not available as a binary package for your Linux distribution, you can install it with the Python package manager, pip. On Ubuntu and other Apt-based distributions, you can install pip and Beautiful Soup with these two commands:

sudo apt-get install python3-pip
pip install requests BeautifulSoup4

With Beautiful Soup, you can create scraping scripts for simple text and image scraping or for more complex projects.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

Subscribe to our Linux newsletters

News