Common Crawl
FAQ
Download the entire web to kick-start a data science empire.
Q Is this some new swimming stroke that's all the rage?
A Is that really the best guess you can come up with? The Common Crawl project [1] scrapes the web, sucking up as much information as possible, and makes this data available for anyone who wants to use it. Data is released approximately every month and goes back to 2007.
Q They scrape the web for pages that are accessible to the public and make this data available to the public? What exactly is this meant to achieve?
[...]
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News
-
Linux Hits an Important Milestone
If you pay attention to the news in the Linux-sphere, you've probably heard that the open source operating system recently crashed through a ceiling no one thought possible.
-
Plasma Bigscreen Returns
A developer discovered that the Plasma Bigscreen feature had been sitting untouched, so he decided to do something about it.
-
CachyOS Now Lets Users Choose Their Shell
Imagine getting the opportunity to select which shell you want during the installation of your favorite Linux distribution. That's now a thing.
-
Wayland 1.24 Released with Fixes and New Features
Wayland continues to move forward, while X11 slowly vanishes into the shadows, and the latest release includes plenty of improvements.
-
Bugs Found in sudo
Two critical flaws allow users to gain access to root privileges.
-
Fedora Continues 32-Bit Support
In a move that should come as a relief to some portions of the Linux community, Fedora will continue supporting 32-bit architecture.
-
Linux Kernel 6.17 Drops bcachefs
After a clash over some late fixes and disagreements between bcachefs's lead developer and Linus Torvalds, bachefs is out.
-
ONLYOFFICE v9 Embraces AI
Like nearly all office suites on the market (except LibreOffice), ONLYOFFICE has decided to go the AI route.
-
Two Local Privilege Escalation Flaws Discovered in Linux
Qualys researchers have discovered two local privilege escalation vulnerabilities that allow hackers to gain root privileges on major Linux distributions.
-
New TUXEDO InfinityBook Pro Powered by AMD Ryzen AI 300
The TUXEDO InfinityBook Pro 14 Gen10 offers serious power that is ready for your business, development, or entertainment needs.