A Bash DIY data extraction tool

If your research involves pulling large amounts of text data from the Internet, you can gather and process that data from the command line with a few simple Bash commands and turn it into a CSV file for your favorite statistical application, such as SPSS, R, or a MySQL table. In this article, I will show how to accomplish this with a project that examines the Romanian university dropout rate.

The data I need comes from 97 universities. For confidentiality reasons, chances are slim that I can get access to each university's database, but I can obtain that information legally from their website. (However keep in mind that many websites have licenses that prohibit web scraping. This article does not attempt to address copyright and other legal issues related to this practice. See the site's permission page and consult the applicable laws for your jurisdiction.) To gather my data, I could search for the word abandon (Romanian for dropout) on each of the 97 websites, but that would be tedious. Furthermore, each website may use a different content management system (CMS), so my search might not return the desired results. Instead, an easier option is to download all 97 websites in their entirety and recursively search their text content on my local hard drive. Linux lets you do this with the command shown in Listing 1.

Retrieving Data

In Listing 1, wget is a command-line utility in Linux and other POSIX-compliant operating systems used to download files from servers. It can be used as a mass downloader, and you can specify exactly which type of files you want downloaded and which type of files wget should disregard.

[...]

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Download Article PDF now with Express Checkout

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subscriptions

Digital Subscriptions

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Ubuntu 26.04 Beta Arrives with Some Surprises

Games , graphics , Ubuntu

Ubuntu 26.04 is almost here, but the beta version has been released, and it might surprise some people.
Ubuntu MATE Dev Leaving After 12 years

projects , Ubuntu , Ubuntu MATE

Martin Wimpress, the maintainer of Ubuntu MATE, is now searching for his successor. Are you the next in line?
Kali Linux Waxes Nostalgic with BackTrack Mode

Kali Linux , Operating Systems , penetration tes...

For those who've used Kali Linux since its inception, the changes with the new release are sure to put a smile on your face.
Gnome 50 Smooths Out NVIDIA GPU Issues

Desktop , Games , Gnome

Gamers rejoice, your favorite pastime just got better with Gnome 50 and NVIDIA GPUs.
System76 Retools Thelio Desktop

Performance , Thelio

The new Thelio Mira has landed with improved performance, repairability, and front-facing ports alongside a high-quality tempered glass facade.
Some Linux Distros Skirt Age Verification Laws

Operating Systems , Software

After California introduced an age verification law recently, open source operating system developers have had to get creative with how they deal with it.
UN Creates Open Source Portal

Community , open source

In a quest to strengthen open source collaboration, the United Nations Office of Information and Communications Technology has created a new portal.
Latest Linux Kernel RC Contains Changes Galore

Community , Kernel

Linux kernel 7.0-rc3 includes more changes than have been made in a single release in recent history.
Nitrux 6.0 Now Ready to Rock Your World

DEBIAN , Desktop , Nitrux

The latest iteration of the Debian-based distribution includes all kinds of newness.
Linux Foundation Reports that Open Source Delivers Better ROI

Community , open source , Software

In a report that may surprise no one in the Linux community, the Linux Foundation found that businesses are finding a 5X return on investment with open source software.

A Bash DIY data extraction tool

Retrieving Data

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Ubuntu 26.04 Beta Arrives with Some Surprises

Ubuntu MATE Dev Leaving After 12 years

Kali Linux Waxes Nostalgic with BackTrack Mode

Gnome 50 Smooths Out NVIDIA GPU Issues

System76 Retools Thelio Desktop

Some Linux Distros Skirt Age Verification Laws

UN Creates Open Source Portal

Latest Linux Kernel RC Contains Changes Galore

Nitrux 6.0 Now Ready to Rock Your World

Linux Foundation Reports that Open Source Delivers Better ROI

A Bash DIY data extraction tool

Retrieving Data

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters