Common Crawl
FAQ
Download the entire web to kick-start a data science empire.
Q Is this some new swimming stroke that's all the rage?
A Is that really the best guess you can come up with? The Common Crawl project [1] scrapes the web, sucking up as much information as possible, and makes this data available for anyone who wants to use it. Data is released approximately every month and goes back to 2007.
Q They scrape the web for pages that are accessible to the public and make this data available to the public? What exactly is this meant to achieve?
A Loads of information is available on the web, but there's little structure to it. There's no centralized index – no single place you can go to find out what is on this vast area of cyberspace we call the web. All the average person can do is either click through links and slowly traverse the network of information, or rely on a commercial search engine such as Google or Bing to help make sense of what's out there.
The Common Crawl project puts the the web into a machine-readable format that allows anyone – whether they're an individual, a startup, or a multinational company – to analyze the information on the web for whatever purpose they want (Figure 1).
Q So, Common Crawl makes it easy to set up my own competitor to Google?
A Not really. Real-time web searching is hard, and gathering the data needed for the search is one of the simplest parts of the challenge. Common Crawl lets you analyze the information on the web in ways that simply aren't possible with existing commercial tools. Suppose, for example, Graham Morrison, Mike Saunders, Andrew Gregory, and I are in a pub and decide that the person with the most mentions on the web should buy the next round of drinks. How should we go about that? We can search for our names on Google (adding the word Linux to try and weed out other people of the same name) and the results page tells us that Mike Saunders has "About 15,500 results." That's more than the rest of us, but it doesn't sound like a very scientific answer. Surely Google knows the actual number and doesn't have to hedge. What's more, that only counts the number of pages that includes the result for Mike, not the actual number of mentions of his name. This information simply isn't available from Google.
To persuade Mike that he has the most mentions and therefore has to buy the next round (Mike doesn't buy a round without overwhelming evidence), we need to do our own scanning of the web. To do that, we need a dump of the web in a machine-readable format. In other words, we need Common Crawl. With a simple bit of code, we can scan through the data and analyze it in anyway we want, including counting the number of times our names are mentioned.
Obviously, this is an esoteric example, but it's easy to see situations where the information could be more useful. For example, you could use a sentiment analyzer to see how the writing on the web views a particular topic (even using the monthly dumps to see how this has changed over time), or analyze how different websites view different topics, or … well, you get the idea. It gives you the ability to analyze the information on the web without setting up your own crawling infrastructure.
Q You've made a lot of sweeping statements there about what you can do. I assume you need some technical skills, however. How hard is it to run an analysis on the Common Crawl dataset?
A Surprisingly easy. The data is all stored in WARC format, which was developed for the Internet Archive, but it's a plain-text format that's quite easy to process. There's a library for reading them in Pytho,n but if you prefer another language, you shouldn't have too much trouble getting the data in. Beyond that, it depends on what sort of processing you intend to do. Searching for particular terms is easy; building neural networks to identify complex features of text is harder. The point is, though, that with this data you can focus on the processing and not worry about getting hold of the data in the first place.
Q So there's nothing hard about processing the data at all?
A Well, there is one thing…
Q Go on.
A The full data dump is 250TB.
Q 250 terabytes! How on earth am I supposed to download that, let alone process it.
A Well, that's the uncompressed size, so downloading it is a little easier. It's designed for processing in chunks so you can download a bit, process that, then move on to the next part. On a single computer, this could take a while, but you can speed up matters by spreading the load across machines using something like Hadoop's MapReduce.
Q A minute ago you were telling me this was easy, and now you're telling me that I need to use Hadoop! This doesn't sound easy at all.
A Well, you don't need to use Hadoop, but it can help. It also doesn't need to be hard. The Common Crawl team has put together a tool for launching MapReduce jobs that automatically links in the data. You just have to fill in the details and provide the Python code to process the data. You can see the examples online [2].
Q Well, yes, but you've neatly stepped over the part where you have to set up the Hadoop cluster first.
A Yes, it does need a Hadoop cluster, but this doesn't have to be a pain to set up either. If you don't have machines to run it on, you can spin up machines in the cloud and run it there. The Common Crawl data is hosted in Amazon's S3 storage, and you can link it to their cloud machines without paying for bandwidth. Using Elastic MapReduce and the cc-mrjob, you can automatically spin up a Hadoop cluster using cheap spot instances and process the data using just a single command.
Q Ok, that doesn't sound too bad, but is there anything I can do with this data without creating clusters of machines to download terabytes of data?
A As it happens, yes. The Web Data Commons project [3] analyzes the common crawl dataset and pulls out some useful information (Figure 2). The result is smaller datasets that are more manageable but don't have as much data. For example, there's a dataset of all locations linked to web pages that's 700MB, all calendar events on the web (2GB), all reviews (3GB), and more. Head to the Web Data Commons website to download the files.
Infos
- Common Crawl: http://www.commoncrawl.org
- MapReduce examples: https://github.com/commoncrawl/cc-mrjob
- Web Data Commons: http://webdatacommons.org/
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Plasma 6.3 Ready for Public Beta Testing
Plasma 6.3 will ship with KDE Gear 24.12.1 and KDE Frameworks 6.10, along with some new and exciting features.
-
Budgie 10.10 Scheduled for Q1 2025 with a Surprising Desktop Update
If Budgie is your desktop environment of choice, 2025 is going to be a great year for you.
-
Firefox 134 Offers Improvements for Linux Version
Fans of Linux and Firefox rejoice, as there's a new version available that includes some handy updates.
-
Serpent OS Arrives with a New Alpha Release
After months of silence, Ikey Doherty has released a new alpha for his Serpent OS.
-
HashiCorp Cofounder Unveils Ghostty, a Linux Terminal App
Ghostty is a new Linux terminal app that's fast, feature-rich, and offers a platform-native GUI while remaining cross-platform.
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.