A Bash DIY data extraction tool

If your research involves pulling large amounts of text data from the Internet, you can gather and process that data from the command line with a few simple Bash commands and turn it into a CSV file for your favorite statistical application, such as SPSS, R, or a MySQL table. In this article, I will show how to accomplish this with a project that examines the Romanian university dropout rate.

The data I need comes from 97 universities. For confidentiality reasons, chances are slim that I can get access to each university's database, but I can obtain that information legally from their website. (However keep in mind that many websites have licenses that prohibit web scraping. This article does not attempt to address copyright and other legal issues related to this practice. See the site's permission page and consult the applicable laws for your jurisdiction.) To gather my data, I could search for the word abandon (Romanian for dropout) on each of the 97 websites, but that would be tedious. Furthermore, each website may use a different content management system (CMS), so my search might not return the desired results. Instead, an easier option is to download all 97 websites in their entirety and recursively search their text content on my local hard drive. Linux lets you do this with the command shown in Listing 1.

Retrieving Data

In Listing 1, wget is a command-line utility in Linux and other POSIX-compliant operating systems used to download files from servers. It can be used as a mass downloader, and you can specify exactly which type of files you want downloaded and which type of files wget should disregard.

[...]

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Fedora Continues 32-Bit Support

Fedora , Games , Linux

In a move that should come as a relief to some portions of the Linux community, Fedora will continue supporting 32-bit architecture.
Linux Kernel 6.17 Drops bcachefs

Filesystem , Kernel , Linux

After a clash over some late fixes and disagreements between bcachefs's lead developer and Linus Torvalds, bachefs is out.
ONLYOFFICE v9 Embraces AI

Artificial Inte... , open source , OpenOffice

Like nearly all office suites on the market (except LibreOffice), ONLYOFFICE has decided to go the AI route.
Two Local Privilege Escalation Flaws Discovered in Linux

Kernel , Linux , Security

Qualys researchers have discovered two local privilege escalation vulnerabilities that allow hackers to gain root privileges on major Linux distributions.
New TUXEDO InfinityBook Pro Powered by AMD Ryzen AI 300

Hardware , Linux , Notebook

The TUXEDO InfinityBook Pro 14 Gen10 offers serious power that is ready for your business, development, or entertainment needs.
Danish Ministry of Digital Affairs Transitions to Linux

LibreOffice , Linux , Windows

Another major organization has decided to kick Microsoft Windows and Office to the curb in favor of Linux.
Linux Mint 20 Reaches EOL

With Linux Mint 20 at its end of life, the time has arrived to upgrade to Linux Mint 22.
TuxCare Announces Support for AlmaLinux 9.2

AlmaLinux , Enterprise Linux , Security

Thanks to TuxCare, AlmaLinux 9.2 (and soon version 9.6) now enjoys years of ongoing patching and compliance.
Go-Based Botnet Attacking IoT Devices

IoT , Security , Systemd

Using an SSH credential brute-force attack, the Go-based PumaBot is exploiting IoT devices everywhere.
Plasma 6.5 Promises Better Memory Optimization

Desktop , Linux , Plasma

With the stable Plasma 6.4 on the horizon, KDE has a few new tricks up its sleeve for Plasma 6.5.

A Bash DIY data extraction tool

Retrieving Data

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Fedora Continues 32-Bit Support

Linux Kernel 6.17 Drops bcachefs

ONLYOFFICE v9 Embraces AI

Two Local Privilege Escalation Flaws Discovered in Linux

New TUXEDO InfinityBook Pro Powered by AMD Ryzen AI 300

Danish Ministry of Digital Affairs Transitions to Linux

Linux Mint 20 Reaches EOL

TuxCare Announces Support for AlmaLinux 9.2

Go-Based Botnet Attacking IoT Devices

Plasma 6.5 Promises Better Memory Optimization

A Bash DIY data extraction tool

Retrieving Data

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters