Common Crawl

FAQ

Article from Issue 200/2017
Author(s):

Download the entire web to kick-start a data science empire.

Q Is this some new swimming stroke that's all the rage?

A Is that really the best guess you can come up with? The Common Crawl project [1] scrapes the web, sucking up as much information as possible, and makes this data available for anyone who wants to use it. Data is released approximately every month and goes back to 2007.

Q They scrape the web for pages that are accessible to the public and make this data available to the public? What exactly is this meant to achieve?

A Loads of information is available on the web, but there's little structure to it. There's no centralized index – no single place you can go to find out what is on this vast area of cyberspace we call the web. All the average person can do is either click through links and slowly traverse the network of information, or rely on a commercial search engine such as Google or Bing to help make sense of what's out there.

The Common Crawl project puts the the web into a machine-readable format that allows anyone – whether they're an individual, a startup, or a multinational company – to analyze the information on the web for whatever purpose they want (Figure 1).

Figure 1: Head to the Common Crawl website to find out how to get started with the dataset.

Q So, Common Crawl makes it easy to set up my own competitor to Google?

A Not really. Real-time web searching is hard, and gathering the data needed for the search is one of the simplest parts of the challenge. Common Crawl lets you analyze the information on the web in ways that simply aren't possible with existing commercial tools. Suppose, for example, Graham Morrison, Mike Saunders, Andrew Gregory, and I are in a pub and decide that the person with the most mentions on the web should buy the next round of drinks. How should we go about that? We can search for our names on Google (adding the word Linux to try and weed out other people of the same name) and the results page tells us that Mike Saunders has "About 15,500 results." That's more than the rest of us, but it doesn't sound like a very scientific answer. Surely Google knows the actual number and doesn't have to hedge. What's more, that only counts the number of pages that includes the result for Mike, not the actual number of mentions of his name. This information simply isn't available from Google.

To persuade Mike that he has the most mentions and therefore has to buy the next round (Mike doesn't buy a round without overwhelming evidence), we need to do our own scanning of the web. To do that, we need a dump of the web in a machine-readable format. In other words, we need Common Crawl. With a simple bit of code, we can scan through the data and analyze it in anyway we want, including counting the number of times our names are mentioned.

Obviously, this is an esoteric example, but it's easy to see situations where the information could be more useful. For example, you could use a sentiment analyzer to see how the writing on the web views a particular topic (even using the monthly dumps to see how this has changed over time), or analyze how different websites view different topics, or … well, you get the idea. It gives you the ability to analyze the information on the web without setting up your own crawling infrastructure.

Q You've made a lot of sweeping statements there about what you can do. I assume you need some technical skills, however. How hard is it to run an analysis on the Common Crawl dataset?

A Surprisingly easy. The data is all stored in WARC format, which was developed for the Internet Archive, but it's a plain-text format that's quite easy to process. There's a library for reading them in Pytho,n but if you prefer another language, you shouldn't have too much trouble getting the data in. Beyond that, it depends on what sort of processing you intend to do. Searching for particular terms is easy; building neural networks to identify complex features of text is harder. The point is, though, that with this data you can focus on the processing and not worry about getting hold of the data in the first place.

Q So there's nothing hard about processing the data at all?

A Well, there is one thing…

Q Go on.

A The full data dump is 250TB.

Q 250 terabytes! How on earth am I supposed to download that, let alone process it.

A Well, that's the uncompressed size, so downloading it is a little easier. It's designed for processing in chunks so you can download a bit, process that, then move on to the next part. On a single computer, this could take a while, but you can speed up matters by spreading the load across machines using something like Hadoop's MapReduce.

Q A minute ago you were telling me this was easy, and now you're telling me that I need to use Hadoop! This doesn't sound easy at all.

A Well, you don't need to use Hadoop, but it can help. It also doesn't need to be hard. The Common Crawl team has put together a tool for launching MapReduce jobs that automatically links in the data. You just have to fill in the details and provide the Python code to process the data. You can see the examples online [2].

Q Well, yes, but you've neatly stepped over the part where you have to set up the Hadoop cluster first.

A Yes, it does need a Hadoop cluster, but this doesn't have to be a pain to set up either. If you don't have machines to run it on, you can spin up machines in the cloud and run it there. The Common Crawl data is hosted in Amazon's S3 storage, and you can link it to their cloud machines without paying for bandwidth. Using Elastic MapReduce and the cc-mrjob, you can automatically spin up a Hadoop cluster using cheap spot instances and process the data using just a single command.

Q Ok, that doesn't sound too bad, but is there anything I can do with this data without creating clusters of machines to download terabytes of data?

A As it happens, yes. The Web Data Commons project [3] analyzes the common crawl dataset and pulls out some useful information (Figure 2). The result is smaller datasets that are more manageable but don't have as much data. For example, there's a dataset of all locations linked to web pages that's 700MB, all calendar events on the web (2GB), all reviews (3GB), and more. Head to the Web Data Commons website to download the files.

Figure 2: Interested in the data but don't want to deal with the full 250TB of Common Crawl? The Web Data Commons project picks out some highlights.

Infos

  1. Common Crawl: http://www.commoncrawl.org
  2. MapReduce examples: https://github.com/commoncrawl/cc-mrjob
  3. Web Data Commons: http://webdatacommons.org/

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Hadoop 2 and Apache Spark

    Hadoop version 2 has transitioned from an application to a Big Data platform. Reports of its demise are premature at best.

  • FAQ – Apache Spark

    Spread your processing load across hundreds of machines as easily as running it locally.

  • Hadoop

    Experience the power of supercomputing and the big data revolution with Apache Hadoop.

  • ThinkUp

    Community managers, professional marketers, and active social media users want to know the effect their messages have on followers. ThinkUp can help.

  • Welcome

    As everyone knows, we journalists are always looking for the next big thing. High-tech journalists are especially attuned to this quest, because what is high tech but the history of the next big thing unfolding?

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News